Hive文件壓縮測試

阿新 • • 發佈：2017-06-29

hadoop hive

hive上可以使用多種格式，比如純文本，lzo、orc等，為了搞清楚它們之間的關系，特意做個測試。

一、建立樣例表

hive> create table tbl( id int, name string ) row format delimited fields terminated by ‘|‘ stored as textfile;

Time taken: 0.338 seconds

hive> load data local inpath ‘/home/grid/users.txt‘ into table tbl;

Copying data from file:/home/grid/users.txt

Copying file: file:/home/grid/users.txt

Loading data to table default.tbl

Table default.tbl stats: [numFiles=1, numRows=0, totalSize=111, rawDataSize=0]

Time taken: 0.567 seconds

hive> select * from tbl;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.237 seconds, Fetched: 14 row(s)

二、測試寫入

1、無壓縮

hive> set hive.exec.compress.output;

hive.exec.compress.output=false

hive>

> create table tbltxt as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0001, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0001/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498527794024_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 10:55:29,906 Stage-1 map = 0%, reduce = 0%

2017-06-27 10:55:39,532 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.66 sec

MapReduce Total cumulative CPU time: 2 seconds 660 msec

Ended Job = job_1498527794024_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_10-55-18_962_2187345348997213497-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbltxt

Table default.tbltxt stats: [numFiles=1, numRows=14, totalSize=111, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.66 sec HDFS Read: 318 HDFS Write: 181 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 660 msec

Time taken: 22.056 seconds

hive>

> show create table tbltxt;

CREATE TABLE `tbltxt`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbltxt‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘111‘,

‘transient_lastDdlTime‘=‘1498532140‘)

Time taken: 0.202 seconds, Fetched: 18 row(s)

hive>

> select * from tbltxt;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.059 seconds, Fetched: 14 row(s)

hive>

> dfs -ls /user/hive/warehouse/tbltxt;

Found 1 items

-rwxr-xr-x 1 grid supergroup 111 2017-06-27 10:55 /user/hive/warehouse/tbltxt/000000_0

hive>

> dfs -cat /user/hive/warehouse/tbltxt/000000_0;

1Awyp

2Azs

3Als

4Aww

5Awyp2

6Awyp3

7Awyp4

8Awyp5

9Awyp6

10Awyp7

11Awyp8

12Awyp5

13Awyp9

14Awyp20

讀取和寫入的格式為：

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

數據可以正常讀出，數據格式為純文本，可以直接用cat查看

2、使用壓縮，格式為默認的壓縮

hive>

> set hive.exec.compress.output=true;

hive>

> set mapred.output.compression.codec;

mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec

可見當前壓縮格式為默認的DefaultCodec。

hive>

> create table tbldefault as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0002, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0002/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498527794024_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:14:44,845 Stage-1 map = 0%, reduce = 0%

2017-06-27 11:14:48,964 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec

MapReduce Total cumulative CPU time: 1 seconds 80 msec

Ended Job = job_1498527794024_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-14-39_351_6035948930260680086-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbldefault

Table default.tbldefault stats: [numFiles=1, numRows=14, totalSize=76, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.08 sec HDFS Read: 318 HDFS Write: 150 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 80 msec

Time taken: 10.842 seconds

hive>

> show create table tbldefault;

CREATE TABLE `tbldefault`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbldefault‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘76‘,

‘transient_lastDdlTime‘=‘1498533290‘)

Time taken: 0.044 seconds, Fetched: 18 row(s)

hive>

> select * from tbldefault;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.037 seconds, Fetched: 14 row(s)

hive>

> dfs -ls /user/hive/warehouse/tbldefault;

Found 1 items

-rwxr-xr-x 1 grid supergroup 76 2017-06-27 11:14 /user/hive/warehouse/tbldefault/000000_0.deflate

hive>

> dfs -cat /user/hive/warehouse/tbldefault/000000_0.deflate;

xws

dfX0)60K:HBhive>

可見在默認壓縮下，表的讀寫格式與txt一樣，但數據文件是經過默認庫壓縮的，後綴名為deflate，用戶無法直接查看內容。意味著org.apache.hadoop.mapred.TextInputFormat這種input可以根據後綴識別默認壓縮，並讀出內容。

3、lzo壓縮

hive>

> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

hive>

> create table tbllzo as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0003, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0003/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498527794024_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:29:08,436 Stage-1 map = 0%, reduce = 0%

2017-06-27 11:29:14,638 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec

MapReduce Total cumulative CPU time: 1 seconds 870 msec

Ended Job = job_1498527794024_0003

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-29-03_249_4340474818139134521-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbllzo

Table default.tbllzo stats: [numFiles=1, numRows=14, totalSize=106, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 318 HDFS Write: 176 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 870 msec

Time taken: 13.744 seconds

hive>

> show create table tbllzo;

CREATE TABLE `tbllzo`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbllzo‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘106‘,

‘transient_lastDdlTime‘=‘1498534156‘)

Time taken: 0.044 seconds, Fetched: 18 row(s)

hive>

> select * from tbllzo;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.032 seconds, Fetched: 14 row(s)

hive>

> dfs -ls /user/hive/warehouse/tbllzo;

Found 1 items

-rwxr-xr-x 1 grid supergroup 106 2017-06-27 11:29 /user/hive/warehouse/tbllzo/000000_0.lzo_deflate

hive>

> dfs -cat /user/hive/warehouse/tbllzo/000000_0.lzo_deflate;

ob1Awyp

2Azs

3Als

4Aww

5Awyp2

125

13Awyp9

14Awyp20

在lz壓縮下，表的讀寫格式仍然是org.apache.hadoop.mapred.TextInputFormat，數據文件後綴名為.lzo_deflate，用戶無法直接查看內容。也就是說，org.apache.hadoop.mapred.TextInputFormat這種input可以識別lzo壓縮並讀出內容。（真強大！）

4、lzop壓縮

hive>

> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

hive>

> create table tbllzop as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0004, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0004/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498527794024_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:37:28,010 Stage-1 map = 0%, reduce = 0%

2017-06-27 11:37:32,127 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.1 sec

MapReduce Total cumulative CPU time: 2 seconds 100 msec

Ended Job = job_1498527794024_0004

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-37-23_099_3493082162039010112-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbllzop

Table default.tbllzop stats: [numFiles=1, numRows=14, totalSize=148, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.1 sec HDFS Read: 318 HDFS Write: 219 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 100 msec

Time taken: 10.233 seconds

hive>

> show create table tbllzop;

CREATE TABLE `tbllzop`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbllzop‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘148‘,

‘transient_lastDdlTime‘=‘1498534653‘)

Time taken: 0.046 seconds, Fetched: 18 row(s)

hive>

> select * from tbllzop;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.033 seconds, Fetched: 14 row(s)

hive>

> dfs -ls /user/hive/warehouse/tbllzop;

Found 1 items

-rwxr-xr-x 1 grid supergroup 148 2017-06-27 11:37 /user/hive/warehouse/tbllzop/000000_0.lzo

hive>

> dfs -cat /user/hive/warehouse/tbllzop/000000_0.lzo;

ob1Awyp

2Azs

3Als

4Aww

5Awyp2

125

13Awyp9

14Awyp20

同樣，在lzop壓縮下，表的讀寫格式仍然是org.apache.hadoop.mapred.TextInputFormat，數據文件後綴名為.lzo，用戶無法直接查看內容。org.apache.hadoop.mapred.TextInputFormat可以識別lzop壓縮並讀出內容

從以上幾種情況可以看出，不管使用哪種壓縮，在hive看來都屬於純文本（只是使用了不同方法壓縮而已），使用org.apache.hadoop.mapred.TextInputFormat都可以讀取，而且hive在插入時只會根據mapred.output.compression.codec來壓縮（而不會管表定義的inputFormat是什麽）。以下可以驗證一下：

1、set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec時插入數據，數據文件是lzop的壓縮，且可以正常讀出。

hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

hive>

> create table tbltest1( id int, name string )

> stored as inputformat ‘org.apache.hadoop.mapred.TextInputFormat‘

> outputformat ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘;

Time taken: 0.493 seconds

hive>

> insert into table tbltest1 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0001, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0001/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498660018952_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 22:59:27,886 Stage-1 map = 0%, reduce = 0%

2017-06-28 22:59:36,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.25 sec

MapReduce Total cumulative CPU time: 2 seconds 250 msec

Ended Job = job_1498660018952_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_22-59-14_730_4437480099583255943-1/-ext-10000

Loading data to table default.tbltest1

Table default.tbltest1 stats: [numFiles=1, numRows=14, totalSize=148, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.25 sec HDFS Read: 318 HDFS Write: 220 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 250 msec

Time taken: 24.151 seconds

hive>

> dfs -ls /user/hive/warehouse/tbltest1;

Found 1 items

-rwxr-xr-x 1 grid supergroup 148 2017-06-28 22:59 /user/hive/warehouse/tbltest1/000000_0.lzo

hive>

> select * from tbltest1;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.055 seconds, Fetched: 14 row(s)

2、set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec時插入數據，數據文件是默認的壓縮，且可以正常讀出。

hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;

hive> create table tbltest2( id int, name string )

> stored as inputformat ‘org.apache.hadoop.mapred.TextInputFormat‘

> outputformat ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘;

Time taken: 0.142 seconds

hive> insert into table tbltest2 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0002, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0002/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498660018952_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 23:09:06,439 Stage-1 map = 0%, reduce = 0%

2017-06-28 23:09:11,668 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.15 sec

MapReduce Total cumulative CPU time: 1 seconds 150 msec

Ended Job = job_1498660018952_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_23-09-01_674_9172062679713398655-1/-ext-10000

Loading data to table default.tbltest2

Table default.tbltest2 stats: [numFiles=1, numRows=14, totalSize=76, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.15 sec HDFS Read: 318 HDFS Write: 148 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 150 msec

Time taken: 11.278 seconds

hive>

> dfs -ls /user/hive/warehouse/tbltest2;

Found 1 items

-rwxr-xr-x 1 grid supergroup 76 2017-06-28 23:09 /user/hive/warehouse/tbltest2/000000_0.deflate

hive>

> select * from tbltest2;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.035 seconds, Fetched: 14 row(s)

3、當表是orc格式時，會按照ORC格式進行壓縮，不受mapred.output.compression.codec和hive.exec.compress.output影響。

hive> set hive.exec.compress.output=false;

hive> create table tbltest3( id int, name string )

> stored as orc tblproperties("orc.compress"="SNAPPY");

Time taken: 0.08 seconds

hive> insert into table tbltest3 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0003, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0003/

Kill Command = /opt/hadoop/bin/hadoop job -kill job_1498660018952_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 23:30:29,865 Stage-1 map = 0%, reduce = 0%

2017-06-28 23:30:34,007 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.14 sec

MapReduce Total cumulative CPU time: 1 seconds 140 msec

Ended Job = job_1498660018952_0003

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_23-30-25_350_7458831371800658041-1/-ext-10000

Loading data to table default.tbltest3

Table default.tbltest3 stats: [numFiles=1, numRows=14, totalSize=365, rawDataSize=1288]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.14 sec HDFS Read: 318 HDFS Write: 439 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 140 msec

Time taken: 9.963 seconds

hive> dfs -ls /user/hive/warehouse/tbltest3;

Found 1 items

-rwxr-xr-x 1 grid supergroup 365 2017-06-28 23:30 /user/hive/warehouse/tbltest3/000000_0

hive>

> dfs -cat /user/hive/warehouse/tbltest3/000000_0;

ORC

)

A+_Az_

[email protected]+y-Az_A+_A++A+y-2345678,5A+y-9A+y-20

hive>

> show create table tbltest3;

CREATE TABLE `tbltest3`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.ql.io.orc.OrcSerde‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbltest3‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘orc.compress‘=‘SNAPPY‘,

‘rawDataSize‘=‘1288‘,

‘totalSize‘=‘365‘,

‘transient_lastDdlTime‘=‘1498663835‘)

Time taken: 0.217 seconds, Fetched: 19 row(s)

hive>

> select * from tbltest3;

1 Awyp

2 Azs

3 Als

4 Aww

5 Awyp2

6 Awyp3

7 Awyp4

8 Awyp5

9 Awyp6

10 Awyp7

11 Awyp8

12 Awyp5

13 Awyp9

14 Awyp20

Time taken: 0.689 seconds, Fetched: 14 row(s)

可見當orc格式時，插入數據並不受壓縮參數的影響。而且inputformat和outputformat已經不再是text。

三、總結

1、不管是無壓縮，還是默認壓縮，還是lzo和lzop等格式，對hive來說都是文本格式，可以根據數據文件的後綴名自動識別，寫入時根據參數決定是否壓縮以及壓縮成什麽格式

2、orc對hive來說是另外一種格式，不管參數如何指定，都會按照建表語名指定的格式來讀取和寫入。

本文出自 “大數據學習探索” 博客，請務必保留此出處http://bigdata1024.blog.51cto.com/6098731/1942877

Hive文件壓縮測試

hadoop hive hive上可以使用多種格式，比如純文本，lzo、orc等，為了搞清楚它們之間的關系，特意做個測試。一、建立樣例表hive> create table tbl( id int, name string ) row format delimited fields termin

Hive文件壓縮測試

Hive文件壓縮測試

Hive文件存儲格式和hive數據壓縮

ST MCU生成PDF+文件壓縮解壓

使用commons-compress操作zip文件(壓縮和解壓縮)

HTTP 之文件壓縮

Linux文件壓縮和打包（gzip、bip2、xz工具）

文件壓縮

文件壓縮和解壓縮工具類

Java 批量文件壓縮導出，並下載到本地

linux文件壓縮和打包（上）

Linux文件壓縮與打包

使用gulp自動化打包合並前端靜態資源（CSS、JS文件壓縮、添加版本號）

Linux文件壓縮和打包

Linux學習筆記（十九）文件壓縮

Linux 文件壓縮和解壓縮工具基礎

linux基本命令[文件壓縮解壓縮]

Huffman的應用之文件壓縮與解壓縮

Linux九陰真經之催心掌殘卷9（文件壓縮與歸檔）

【Linux學習筆記】第6章 Linux文件壓縮和打包

liunx文件操作文件壓縮

Hive文件壓縮測試

相關推薦