Hive中的資料壓縮
阿新 • • 發佈:2019-01-06
1.資料檔案儲存格式
下面簡介一下hive 支援的儲存格式
file_format: : SEQUENCEFILE | TEXTFILE -- (Default, depending on hive.default.fileformat configuration) | RCFILE -- (Note: Available in Hive 0.6.0 and later) | ORC -- (Note: Available in Hive 0.11.0 and later) | PARQUET -- (Note: Available in Hive 0.13.0 and later) | AVRO -- (Note: Available in Hive 0.14.0 and later) | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
資料儲存格式分為按行儲存資料和按列儲存資料。
(1)ORCFile(Optimized Row Columnar File):hive/shark/spark支援。使用ORCFile格式儲存列數較多的表。
(2)Parquet(twitter+cloudera開源,被Hive、Spark、Drill、Impala、Pig等支援)。Parquet比較複雜,其靈感主要來自於dremel,parquet儲存結構的主要亮點是支援巢狀資料結構以及高效且種類豐富演算法(以應對不同值分佈特徵的壓縮)。
(1)儲存為TEXTFILE格式
create table page_views( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ; load data local inpath '/opt/datas/page_views.data' into table page_views ; dfs -du -h /user/hive/warehouse/page_views/ ; 18.1 M /user/hive/warehouse/page_views/page_views.data
(2)儲存為ORC格式
create table page_views_orc( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc ; insert into table page_views_orc select * from page_views ; dfs -du -h /user/hive/warehouse/page_views_orc/ ; 2.6 M /user/hive/warehouse/page_views_orc/000000_0
(3)儲存為Parquet格式
create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;
insert into table page_views_parquet select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet/ ;
13.1 M /user/hive/warehouse/page_views_parquet/000000_0
(4)儲存為ORC格式,使用snappy壓縮
create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
insert into table page_views_orc_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;
3.8 M /user/hive/warehouse/page_views_orc_snappy/000000_0
這裡為什麼會大了呢 因為預設使用Gzip壓縮的
(5)儲存為ORC格式,不使用壓縮
create table page_views_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
insert into table page_views_orc_none select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;
7.6 M /user/hive/warehouse/page_views_orc_none/000000_0
(6)儲存為Parquet格式,使用snappy壓縮
set parquet.compression=SNAPPY ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;
insert into table page_views_parquet_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;
2.7 M /user/hive/warehouse/page_views_parquet_snappy/000000_0
在實際的專案開發當中,hive表的資料的儲存格式一般使用orcfile / parquet,資料壓縮一般使用snappy壓縮格式。
轉載自 https://blog.csdn.net/gongxifacai_believe/article/details/80833480