hive表儲存為parquet格式
阿新 • • 發佈:2019-01-25
Hive0.13以後的版本
建立儲存格式為parquet的hive表:
CREATE TABLE parquet_test (
id int,
str string,
mp MAP<STRING,STRING>,
lst ARRAY<STRING>,
strct STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (part string)
STORED AS PARQUET;
測試:
本地生成parquet格式的檔案
>>> import numpy as np >>> import pandas as pd >>> import pyarrow as pa >>> df = pd.DataFrame({'one':['test','lisi','wangwu'], 'two': ['foo', 'bar', 'baz']}) >>> table = pa.Table.from_pandas(df) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet2') # 指定壓縮格式 # 預設使用的snappy >>> pq.write_table(table, 'example.parquet2', compression='snappy') # >>> pq.write_table(table, 'example.parquet2', compression='gzip') # >>> pq.write_table(table, 'example.parquet2', compression='brotli') # >>> pq.write_table(table, 'example.parquet2', compression='none') >>> table2 = pq.read_table('example.parquet2') >>> table2.to_pandas() one two 0 test foo 1 lisi bar 2 wangwu baz
Snappy壓縮具有更好的效能,Gzip壓縮具有更好的壓縮比。
建立hive表並匯入生成的parquet格式資料
hive> create table parquet_example(one string, two string) STORED AS PARQUET; hive> load data local inpath './example.parquet2' overwrite into table parquet_example; hive> select * from parquet_example; OK test foo lisi bar wangwu baz Time taken: 0.071 seconds, Fetched: 3 row(s)
Hive Parquet配置
hive中支援對parquet的配置,主要有:
parquet.compression
parquet.block.size
parquet.page.size
可以在Hive中直接set:
hive> set parquet.compression=snappy
控制Hive的block大小的引數:
parquet.block.size
dfs.blocksize
mapred.max.split.size
參考:
Hive支援Parquet格式:Parquet;