Hadoop-之配置LZO壓縮完整手冊
阿新 • • 發佈:2021-01-10
技術標籤:Hadoophadoophdfs大資料mapreduce
Hadoop-之配置LZO壓縮完整手冊
1 前言
HADOOP本身除了GIP、DEFLATE、BZIP2等壓縮之外是不支援LZO壓縮的,所以我們加入需要讓HDFS支援LZO(一種可切分的壓縮形式,壓縮率也很低)壓縮,我們需要引入Twitter的Hadoop-LZO
,參考地址為:https://github.com/twitter/hadoop-lzo/
2 hadoop-lzo的編譯-構建與配置流程
2.1 環境準備
-
maven
- 下載安裝,環境變數,修改阿里雲映象
-
gcc-c++
-
zlib-devel
-
autoconf
-
automake
-
libtool
#除了maven,其它的前提條件通過yum進行安裝
yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
2.1 下載安裝並編譯lzo
#step1
wget https://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
#step2
tar -zxvf lzo-2.10.tar.gz -C /opt/module
#step3
cd /opt/module/lzo-2.10
#step4
./configure --enable-shared --prefix /usr/local/hadoop/lzo
#step5
make
make install
2.2 編譯hadoop-lzo原始碼
#step1 下載hadoop-lzo原始碼
wget https://github.com/twitter/hadoop-lzo/archive/master.zip
#step2 解壓,並修改配置pom.xml檔案
<hadoop.current.version>2.7.7</hadoop.current.version>
#step3 宣告2個臨時的環境變數
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
export LIBRARY_PATH= /usr/local/hadoop/lzo/lib
#step4 編譯
cd /hadoop-lzo-master
mvn package -Dmaven.test.skip=true
#step5 進入target目錄
#hadoop-lzo 0.4.21-SNAPSHOT.jar就是編譯成功的hadoop-lzo元件
pwd
>>>>>
/opt/module/hadoop-lzo-master/target
lS -ahl
>>>>>
drwxr-xr-x. 2 root root 4096 Jan 9 15:19 antrun
drwxr-xr-x. 4 root root 4096 Jan 9 15:20 apidocs
drwxr-xr-x. 5 root root 77 Jan 9 15:19 classes
drwxr-xr-x. 3 root root 25 Jan 9 15:19 generated-sources
-rw-r--r--. 1 root root 188965 Jan 9 15:19 hadoop-lzo-0.4.21-SNAPSHOT.jar
-rw-r--r--. 1 root root 180845 Jan 9 15:20 hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
-rw-r--r--. 1 root root 52042 Jan 9 15:19 hadoop-lzo-0.4.21-SNAPSHOT-sources.jar
drwxr-xr-x. 2 root root 71 Jan 9 15:20 javadoc-bundle-options
drwxr-xr-x. 2 root root 28 Jan 9 15:19 maven-archiver
drwxr-xr-x. 3 root root 28 Jan 9 15:19 native
drwxr-xr-x. 3 root root 18 Jan 9 15:19 test-classes
2.3 將編譯好的jar包放入hadoop的common目錄下
cp hadoop-lzo-0.4.21-SNAPSHOT.jar /opt/module/hadoop-2.7.7/share/hadoop/common/
2.4 分發jar包到其它的節點
cd /opt/module/hadoop-2.7.7/share/hadoop/common/
xsync hadoop-lzo-0.4.21-SNAPSHOT.jar
2.4 配置core-site.xml檔案,配置LZO,並分發
<!--vim core-site.xml-->
<!--配置hadoop的lzo壓縮支援-->
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
xsync core-site.xml
3 驗證是否配置lzo成功
執行流程如下。
#step1 在本地建立一個檔案test.txt
touch test.txt
echo mother fucker damn shit >> test.txt
#step2 上傳至hdfs
hadoop fs -mkdir /input
hadoop fs -put test.txt /input
#step3 通過自帶的mapreduce包執行wordcount命令如下,將結果輸出到/output,需指定mapreduce的輸出端壓縮配置引數:mapreduce.output.fileoutputformat.compress=true mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
hadoop jar /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output
最終假如在HDFS的web端看到如下結果,說明lzo配置成功。切記
- 我們平常說的lzo壓縮是使用com.hadoop.compression.lzo.LzopCodec
- 如果使用的是com.hadoop.compression.lzo.LzoCodec,那麼生成的檔案就是xx.lzo_deflate