1. 程式人生 > >Hive中的資料壓縮

Hive中的資料壓縮

1.資料檔案儲存格式

下面簡介一下hive 支援的儲存格式

file_format:
	: SEQUENCEFILE
	| TEXTFILE        -- (Default, depending on hive.default.fileformat configuration)
	| RCFILE          -- (Note: Available in Hive 0.6.0 and later)
	| ORC             -- (Note: Available in Hive 0.11.0 and later)
	| PARQUET         -- (Note: Available in Hive 0.13.0 and later)
	| AVRO            -- (Note: Available in Hive 0.14.0 and later)
	| INPUTFORMAT     input_format_classname OUTPUTFORMAT output_format_classname

資料儲存格式分為按行儲存資料和按列儲存資料。
(1)ORCFile(Optimized Row Columnar File):hive/shark/spark支援。使用ORCFile格式儲存列數較多的表。
(2)Parquet(twitter+cloudera開源,被Hive、Spark、Drill、Impala、Pig等支援)。Parquet比較複雜,其靈感主要來自於dremel,parquet儲存結構的主要亮點是支援巢狀資料結構以及高效且種類豐富演算法(以應對不同值分佈特徵的壓縮)。
在這裡插入圖片描述
(1)儲存為TEXTFILE格式

create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

load data local inpath '/opt/datas/page_views.data' into table page_views ;
dfs -du -h /user/hive/warehouse/page_views/ ;
18.1 M  /user/hive/warehouse/page_views/page_views.data

(2)儲存為ORC格式

create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

insert into table page_views_orc select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc/ ;
2.6 M  /user/hive/warehouse/page_views_orc/000000_0

(3)儲存為Parquet格式

create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

insert into table page_views_parquet select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet/ ;
13.1 M  /user/hive/warehouse/page_views_parquet/000000_0

(4)儲存為ORC格式,使用snappy壓縮

create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

insert into table page_views_orc_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;
3.8 M  /user/hive/warehouse/page_views_orc_snappy/000000_0

這裡為什麼會大了呢 因為預設使用Gzip壓縮的

(5)儲存為ORC格式,不使用壓縮

create table page_views_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

insert into table page_views_orc_none select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;
7.6 M  /user/hive/warehouse/page_views_orc_none/000000_0

(6)儲存為Parquet格式,使用snappy壓縮

set parquet.compression=SNAPPY ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;
insert into table page_views_parquet_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;
2.7 M  /user/hive/warehouse/page_views_parquet_snappy/000000_0

在實際的專案開發當中,hive表的資料的儲存格式一般使用orcfile / parquet,資料壓縮一般使用snappy壓縮格式。
轉載自 https://blog.csdn.net/gongxifacai_believe/article/details/80833480

歡迎關注,更多福利

這裡寫圖片描述