hive和impala操作parquet檔案timestamp帶來的困擾

阿新 • • 發佈：2019-02-06

前言：準備使用hive作資料倉庫，因歷史遺留問題，原先遺留的資料處理都是impala處理的，資料檔案是parquet檔案，因本身叢集資源少，而處理的檔案很大，準備使用hive離線分析將小檔案推送到db或者impala進行展示操作。

準備：搭建cdh5.9，將原有的資料從一個叢集遷移到現有的叢集。對資料按照天進行動態分割槽，分割槽資料仍然使用parquet格式。

問題：因分割槽欄位為timestamp型別，一個偶然的機會發現了一個詭異的問題，hive查詢的時間比impala查詢的時間多了8個小時，和原始資料進行比對發現hive處理的timestamp資料有問題。

it seems that when support for saving timestamps in Parquet was added to Hive, the primary goal was to be compatible with Impala's implementation, which probably predates the addition of the timestamp_millis type to the Parquet specification.

Impala's timestamp representation maps to the int96 Parquet type (4 bytes for the date, 8 bytes for the time, details in the linked discussion).

So no, storing a Hive timestamp in Parquet does not use the timestamp_millis type, but Impala's int96 timestamp representation instead.

以上是查到的問題的原因，因英文不好，不是很難就不在作翻譯了。

說說的我的解決措施吧，因我準備後期長期使用hive 而不是使用impala 固將資料timestamp 新增 to_utc_timestamp(insert_time, 'GMT+8') 進行轉換，函式不懂可以自己去查詢下哈，然後重新分割槽使用orcfile（簡單說下orcfile格式，列式儲存，資料檔案佔用空間小）格式進行儲存。

悲催的是impala不支援orcfile格式的資料檔案，無奈只能選擇妥協方案，大資料檔案使用hive離線處理，資料結果推送到impala或者db，儲存格式為impala支援的格式。

僅以此文紀念為解決此問題死傷的腦細胞！

hive和impala操作parquet檔案timestamp帶來的困擾

hive和impala操作parquet檔案timestamp帶來的困擾

上海Cloudera 數據分析師培訓：Pig、Hive和 Impala

zeppelin中連線hive和impala

0039-如何使用Python Impyla客戶端連接Hive和Impala

impala建立parquet檔案的外表及兩個容易忽略的陷阱

記錄一次hive大表脫敏和改造成parquet儲存動態分割槽的操作

hive的資料組織格式和基本操作

Hive常見屬性和互動操作

HIVE安裝和基本操作

Linux使用rz和sz操作上傳和下載檔案

JavaI/O:簡單的使用DataOutputStream和DataInputStream操作檔案流

JavaI/O:簡單的使用FileInput和FileOutputStream操作檔案流

Spring Boot配置檔案和常見操作

hive資料庫概念和基本操作

獲得parquet檔案的rows和filesize

與檔案和目錄操作相關的函式

JAVA學習--檔案流FileInputStream和FileOutputStream操作

Java生成和操作Excel檔案

python中檔案的讀和寫操作

Hive的DML操作資料的匯入和匯出

hive和impala操作parquet檔案timestamp帶來的困擾

相關推薦