Hive 4種檔案格式

阿新 • • 發佈：2019-02-02

hive檔案儲存格式包括以下幾類：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORCFILE(0.11以後出現)

其中TEXTFILE為預設格式，建表時不指定預設為這個格式，匯入資料時會直接把資料檔案拷貝到hdfs上不進行處理；

SEQUENCEFILE，RCFILE，ORCFILE格式的表不能直接從本地檔案匯入資料，資料要先匯入到textfile格式的表中，然後再從表中用insert匯入SequenceFile,RCFile,ORCFile表中。

前提建立環境：

hive 0.8

建立一張testfile_table表，格式為textfile。

create table if not exists testfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as textfile;

load data local inpath '/app/weibo.txt' overwrite into table textfile_table;

一、TEXTFILE
預設格式，資料不做壓縮，磁碟開銷大，資料解析開銷大。
可結合Gzip、Bzip2使用(系統自動檢查，執行查詢時自動解壓)，但使用這種方式，hive不會對資料進行切分，
從而無法對資料進行並行操作。
示例：

create table if not exists textfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by 
 '\t'
stored as textfile;
插入資料操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table textfile_table select * 
 from textfile_table;

二、SEQUENCEFILE
SequenceFile是Hadoop API提供的一種二進位制檔案支援，其具有使用方便、可分割、可壓縮的特點。
SequenceFile支援三種壓縮選擇：NONE，RECORD，BLOCK。Record壓縮率低，一般建議使用BLOCK壓縮。
示例：

create table if not exists seqfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as sequencefile;
插入資料操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
SET mapred.output.compression.type=BLOCK;
insert overwrite table seqfile_table select * from textfile_table;

三、RCFILE
RCFILE是一種行列儲存相結合的儲存方式。首先，其將資料按行分塊，保證同一個record在一個塊上，避免讀一個記錄需要讀取多個block。其次，塊資料列式儲存，有利於資料壓縮和快速的列存取。
RCFILE檔案示例：

create table if not exists rcfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as rcfile;
插入資料操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table rcfile_table select * from textfile_table;

四、ORCFILE()
五、再看TEXTFILE、SEQUENCEFILE、RCFILE三種檔案的儲存情況：

[[email protected] ~]$ hadoop dfs -dus /user/hive/warehouse/*
hdfs://node1:19000/user/hive/warehouse/hbase_table_1    0
hdfs://node1:19000/user/hive/warehouse/hbase_table_2    0
hdfs://node1:19000/user/hive/warehouse/orcfile_table    0
hdfs://node1:19000/user/hive/warehouse/rcfile_table    102638073
hdfs://node1:19000/user/hive/warehouse/seqfile_table   112497695
hdfs://node1:19000/user/hive/warehouse/testfile_table  536799616
hdfs://node1:19000/user/hive/warehouse/textfile_table  107308067
[[email protected] ~]$ hadoop dfs -ls /user/hive/warehouse/*/
-rw-r--r--   2 hadoop supergroup   51328177 2014-03-20 00:42 /user/hive/warehouse/rcfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   51309896 2014-03-20 00:43 /user/hive/warehouse/rcfile_table/000001_0
-rw-r--r--   2 hadoop supergroup   56263711 2014-03-20 01:20 /user/hive/warehouse/seqfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   56233984 2014-03-20 01:21 /user/hive/warehouse/seqfile_table/000001_0
-rw-r--r--   2 hadoop supergroup  536799616 2014-03-19 23:15 /user/hive/warehouse/testfile_table/weibo.txt
-rw-r--r--   2 hadoop supergroup   53659758 2014-03-19 23:24 /user/hive/warehouse/textfile_table/000000_0.gz
-rw-r--r--   2 hadoop supergroup   53648309 2014-03-19 23:26 /user/hive/warehouse/textfile_table/000001_1.gz

總結:
相比TEXTFILE和SEQUENCEFILE，RCFILE由於列式儲存方式，資料載入時效能消耗較大，但是具有較好的壓縮比和查詢響應。資料倉庫的特點是一次寫入、多次讀取，因此，整體來看，RCFILE相比其餘兩種格式具有較明顯的優勢。

Hive 4種檔案格式

Hive 4種檔案格式

自己動手編寫一個Linux偵錯程式系列之4 ELF檔案格式與DWARF除錯格式

a.ou、coff、elf三種檔案格式

hadoop 1.0.4 fsimage 檔案格式分析

@ResponseBody返回4種資料格式的資料

Hive之——Hive支援的檔案格式與壓縮演算法(1.2.1)

Hive的常用三種檔案儲存格式詳解

hive：資料型別及檔案格式

R語言入門到放棄 R語言讀取不同檔案型別中資料的4種方法

python 4種讀寫檔案方法

Spring學習(二)：Spring xml檔案格式、載入上下文六種方式及作用域

javaweb讀取配置檔案的4種方法

4種格式拼接(+,%s,{變數},{0})

【Python】python檔案或文字加密（4種方法）

RandomAccessFile（隨即讀取）操作檔案有4種模式："r"、"rw"、"rws" 或 "rwd"

java 4種方式讀取配置檔案 + 修改配置檔案

4種解決json日期格式問題的辦法

Hive程式設計(十一)【其他檔案格式和壓縮方法】

PE檔案格式詳解(4)

Hive（一）資料型別、檔案格式和資料定義

Hive 4種檔案格式

相關推薦