1. 程式人生 > >壓縮在hive中的使用

壓縮在hive中的使用

用sqoop將資料從MySQL中以snappy壓縮格式匯入至hive中

hive (default)> create table product_info_snappy as select *from product_info where 1=2; (在hive中建立一張表,結構與 product_info相同 。這張表在MySQL的ruozedata5資料庫下面。)

[[email protected] ~]$ sqoop import --connect jdbc:mysql://localhost:3306/ruozedata5 --username root --password 123456  --delete-target-dir --table product_info --fields-terminated-by '\t' --hive-import --hive-overwrite --hive-table product_info_snappy --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec -m 1 --hive-overwrite

[
[email protected]
~]$ hdfs dfs -ls /user/hive/warehouse/product_info_snappy -rwxr-xr-x 1 hadoop supergroup 990 2018-12-05 20:05 /user/hive/warehouse/product_info_snappy/part-m-00000.snappy(可以看到導進去到hive中的表格已經是壓縮後的格式)

lzo lzop壓縮格式的使用

安裝lzop native library

注:所有的操作均在另一臺機器完成,然後scp到當前機器,編譯需要maven,我已經在之前介紹過了,不再贅述。

[[email protected] opt]# yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool
[[email protected] opt]# wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
[[email protected] opt]# tar -zxf lzo-2.06.tar.gz
[[email protected] opt]# cd lzo-2.06
[[email protected]
lzo-2.06]# export CFLAGS=-m64 [[email protected] lzo-2.06]# ./configure -enable-shared -prefix=/usr/local/lzo-2.06 [[email protected] lzo-2.06]# ./configure -enable-shared -prefix=/usr/local/lzo-2.06

安裝hadoop-lzo

[[email protected] opt]# wget https://github.com/twitter/hadoop-lzo/archive/master.zip
[[email protected] opt]# unzip master
[[email protected] opt]# vi hadoop-lzo-master/pom.xml 
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.7.4</hadoop.current.version> #這裡修改成對應的hadoop版本號
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

[[email protected] opt]# cd hadoop-lzo-master/
[[email protected] hadoop-lzo-master]# export CFLAGS=-m64
[[email protected] hadoop-lzo-master]# export C_INCLUDE_PATH=/usr/local/lzo-2.06/include
[[email protected] hadoop-lzo-master]#  export LIBRARY_PATH=/usr/local/lzo-2.06/lib
#mvn clean package -Dmaven.test.skip=true
[[email protected] hadoop-lzo-master]# pwd
/opt/hadoop-lzo-master
[[email protected] hadoop-lzo-master]#  cd target/native/Linux-amd64-64
[[email protected] Linux-amd64-64]#  tar -cBf - -C lib . | tar -xBvf - -C ~
[[email protected] Linux-amd64-64]#  cp ~/libgplcompression*  /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/
[[email protected] target]#   pwd
/opt/hadoop-lzo-master/target
[[email protected] target]# cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/ 

把檔案傳到當前機器下面

[[email protected] target]# cd /usr/local
[[email protected] local]# scp -r lzo-2.06 [email protected]:/opt/
[[email protected] local]#  scp /opt/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT.jar [email protected]:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/
然後把native下的幾個檔案一一發到192.168.2.65這臺機器的hadoop軟體的native下面
傳完之後hadoop下面的檔案要修改使用者使用者組為hadoop

配置192.168.2.65機器

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export LD_LIBRARY_PATH=/usr/local/lzo-2.06/lib
vi $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
 <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>
# vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.map.output.compress.codec</name>
 <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>

注:# lzop -V 用於測試是否安裝成功
#lzop -h #可以看到幫助說明
#lzop access.log #對日誌進行壓縮

lzo例項

把mysql上面的表通過sqoop以壓縮格式匯入hive中

hive (default)> desc test_lzo;
id                   varchar(10)
first_name           varchar(10)
last_name            varchar(10)
sex                  varchar(5)
score                varchar(10)
copy_id              varchar(10)
hive (default)> select * from test_lzo;
Time taken: 0.441 seconds (空的,只有表結構)
[[email protected] ~]$ sqoop import --connect jdbc:mysql://localhost:3306/mysql --username root --password 123456  --delete-target-dir --table test --fields-terminated-by '\t' --hive-import --hive-overwrite --hive-table test_lzo -m 1
[[email protected] hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /user/hive/warehouse/test_lzo
-rwxr-xr-x   1 hadoop supergroup   66407404 2018-12-07 17:03 /user/hive/warehouse/test_lzo/part-m-00000.lzo
[[email protected] hadoop-2.6.0-cdh5.7.0]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/test_lzo/part-m-00000.lzo (生成索引檔案)
[[email protected] hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /user/hive/warehouse/test_lzo/
-rwxr-xr-x   1 hadoop supergroup   66407404 2018-12-07 17:03 /user/hive/warehouse/test_lzo/part-m-00000.lzo
-rw-r--r--   1 hadoop supergroup       4120 2018-12-07 17:45 /user/hive/warehouse/test_lzo/part-m-00000.lzo.index
[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /user/hive/warehouse/test_lzo  /data/test1 (18/12/07 18:51:00 INFO mapreduce.JobSubmitter: number of splits:2)
[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /user/hive/warehouse/test_lzo/part-m-00000.lzo /data/test2(18/12/07 21:39:52 INFO mapreduce.JobSubmitter: number of splits:1)
上述可以看出對/test_lzo/part-m-00000.lzo 檔案做wc 的mr 時候分片數量為1.對/test_lzo 資料夾(此資料夾下面為.lzo 和.lzo.index)做wc 的mr 時候分片數量為2.

注:因為帶index的.lzo檔案分片總是不論大小同意分成兩片,我又參照別人部落格,在mapred-site.xml中增加了一個引數的配置,如下:

<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/lzo-2.06/lib</value>
</property>

修改完之後,重啟hadoop,發現還沒解決index 分片數量為2 的問題。

總結

支隊.lzo檔案做索引的話,mr 的wc 執行的時候會把index檔案誤認為也是一個輸入檔案,所以上面操作的結果一直測出來的分片數目為2,應該在執行的時候指定一個引數,如下:

[[email protected] data1]$ hdfs dfs -ls /data/ice
-rw-r--r--   1 hadoop supergroup  733104209 2018-12-07 22:24 /data/ice/ice.txt.lzo
-rw-r--r--   1 hadoop supergroup      46464 2018-12-08 08:50 /data/ice/ice.txt.lzo.index
[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /data/ice/ice.txt.lzo /data/ice_out
[[email protected] data1]$ hdfs dfs -du -h /data/ice_out
0        0        /data/ice_out/_SUCCESS
689.9 M  689.9 M  /data/ice_out/part-r-00000.lzo(到此為止,終於完成了lzo 分片功能)

參考文章

主要參考
官網
lzop lzo
map輸出的中間資料使用LzoCodec,reduce輸出使用 LzopCodec
解釋了為什麼我的.lzo檔案不足塊的大小128M,依然被分成了兩片
關於創造資料的方法