Hadoop支援Lzo壓縮
阿新 • • 發佈:2018-12-20
1.前置要求
-
編譯安裝好hadoop
-
java & maven 安裝配置好
-
安裝前置庫
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
2.安裝 lzo
2.1 下載
#下載
wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
# 解壓縮
[[email protected] app]$ tar -zxvf lzo-2.06.tar.gz -C .. /app
2.2 增加引數
[[email protected] app]$ cd lzo-2.06/
[[email protected] lzo-2.06]$ export CFLAGS=-m64
# 建立資料夾,用來存放編譯之後的lzo
[[email protected] lzo-2.06]$ mkdir complie
#指定編譯之後的位置
[[email protected] lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/complie/
#開始編譯安裝
[[email protected] lzo-2.06]$ make && make install
# 檢視編譯是否成功 只要有如下內容 就可以了
[[email protected] lzo-2.06]$ cd complie/
[[email protected] complie]$ ll
total 12
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 include
drwxrwxr-x 2 hadoop hadoop 4096 Dec 6 17:08 lib
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 share
[ [email protected] complie]$
3. 安裝hadoop-lzo
3.1 下載 & 解壓
# 下載
[[email protected] soft]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip
#解壓
[[email protected] soft]$ unzip master
# 如果提示沒有 unzip 記得用yum 安裝下
[[email protected] ~]# yum -y install unzip
3.2 修改hadoop-lzo-master下的pom.xml檔案
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0</hadoop.current.version> #這裡修改成對應的hadoop版本號
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
3.3 增加配置
[[email protected] app]$ cd hadoop-lzo-master/
[[email protected] hadoop-lzo-master]$ export CFLAGS=-m64
[[email protected] hadoop-lzo-master]$ export CXXFLAGS=-m64
[[email protected] hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/complie/include/ # 這裡需要提供編譯好的lzo的include檔案
[[email protected] hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib/ # 這裡需要提供編譯好的lzo的lib檔案
[[email protected] hadoop-lzo-master]$
3.4 開始編譯
[[email protected] hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true
出現 BUILD SUCCESS 的時候 說明成功!
3.5 執行如下操作
[[email protected] hadoop-lzo-master]$
# 檢視編譯成功之後的包
[[email protected] hadoop-lzo-master]$ ll
total 80
-rw-rw-r-- 1 hadoop hadoop 35147 Oct 13 2017 COPYING
-rw-rw-r-- 1 hadoop hadoop 19753 Dec 6 17:18 pom.xml
-rw-rw-r-- 1 hadoop hadoop 10170 Oct 13 2017 README.md
drwxrwxr-x 2 hadoop hadoop 4096 Oct 13 2017 scripts
drwxrwxr-x 4 hadoop hadoop 4096 Oct 13 2017 src
drwxrwxr-x 10 hadoop hadoop 4096 Dec 6 17:21 target
# 進入target/native/Linux-amd64-64 目錄下執行如下命令
[[email protected] hadoop-lzo-master]$ cd target/native/Linux-amd64-64
[[email protected] Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
./
./libgplcompression.so
./libgplcompression.so.0
./libgplcompression.la
./libgplcompression.a
./libgplcompression.so.0.0.
[[email protected] Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
# 這裡很重要 需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 複製到hadoop中
[[email protected] hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
[[email protected] hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib
4.配置hadoop配置檔案
4.1 修改 vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# 增加 編譯好的lzo包下的lib
export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib
4.2 修改 vim $HADOOP_HOME/etc/hadoop/core-site.
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.3 修改 vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapred.child.env </name>
<value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.4 重啟hadoop
5.hadoop使用Lzo
5.1 資料準備
我準備好了一個大的資料檔案,使用lzo壓縮下
# 原始大小
[[email protected] data]$ ls -lh
total 5.0G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec 5 17:58 access.20161111.log
[[email protected] data]$
# 使用lzo 壓縮
lzop access.20161111.log
# 壓縮之後的大小
[[email protected] data]$ ls -lh
total 5.9G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec 5 17:58 access.20161111.log
-rw-r--r-- 1 hadoop hadoop 878M Dec 5 17:58 access.20161111.log.lzo
[[email protected] data]$
5.2 上傳資料到hdfs中
# 上傳
[[email protected] data]$ hdfs dfs -put access.20161111.log.lzo /data
#檢視上傳結果
[[email protected] data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r-- 1 hadoop supergroup 920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
[[email protected] data]$
5.3 執行hadoop wc 應用
[[email protected] mapreduce]$ hadoop jar \
hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
/data/access.20161111.log.lzo \
/out
檢視執行過程,可以看到 ** number of splits:1** ,說明 hadoop並沒有 給我的lzo檔案切片
18/12/06 18:39:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:39:00 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:39:00 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:39:00 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0001
18/12/06 18:39:01 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0001
18/12/06 18:39:01 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0001/
18/12/06 18:39:01 INFO mapreduce.Job: Running job: job_1544089631050_0001
5.4 給lzo檔案建立索引
通過之前的學習,我知道,如果使用lzo壓縮的話,需要有lzo的索引檔案,接下來我們生產索引檔案
[[email protected] hadoop-2.6.0-cdh5.7.0]$ hadoop jar \
share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/data/access.20161111.log.lzo
[[email protected] mapreduce]$ hdfs dfs -ls /data
Found 2 items
-rw-r--r-- 1 hadoop supergroup 920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
-rw-r--r-- 1 hadoop supergroup 163088 2018-12-06 18:42 /data/access.20161111.log.lzo.index
[[email protected] mapreduce]$
如上我的索引檔案以及生成,那我繼續執行wc程式,看我的hadoop是否能支援lzo檔案的切片
[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/access.20161111.log.lzo /out1
18/12/06 18:45:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:45:02 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:45:02 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:45:02 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0003
18/12/06 18:45:03 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0003
18/12/06 18:45:03 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0003/
18/12/06 18:45:03 INFO mapreduce.Job: Running job: job_1544089631050_0003
從上面的執行過程,我看出來我的hadoop還是不能將我的lzo檔案給切片,接著翻閱資料…
5.5 更改
通過翻閱資料得知,單純的做了索引還是不行的,在執行程式的時候還要對要執行的程式進行相應的更改,
把inputformat設定成LzoTextInputFormat,不然還是會把索引檔案也當做是輸入檔案,還是隻執行一個map來處理。
所以修改我的提交任務的方式 增加 -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat
[[email protected] mapreduce]$
[[email protected]doop000 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/data/access.20161111.log.lzo \
/out3
18/12/06 18:48:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:48:40 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:48:40 INFO mapreduce.JobSubmitter: number of splits:7
18/12/06 18:48:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0005
18/12/06 18:48:41 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0005
18/12/06 18:48:41 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0005/
18/12/06 18:48:41 INFO mapreduce.Job: Running job: job_1544089631050_0005
^C^C[[email protected] mapreduce]$ ^C
從上述結果看出來hadoop已經能自動的將我的lzo檔案給我切片了~~
成功~~