Hadoop之Lzo壓縮配置
Hadoop之Lzo壓縮配置
一、hadoop-lzo編譯
hadoop本身並不支援lzo壓縮,故需要使用twitter提供的hadoop-lzo開源元件。hadoop-lzo需依賴hadoop和lzo進行編譯,編譯步驟如下。
-
環境準備
maven(下載安裝,配置環境變數,修改sitting.xml加阿里雲映象)
gcc-c++
zlib-devel
autoconf
automake
libtool
通過yum安裝即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool -
下載、安裝並編譯LZO
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
tar -zxvf lzo-2.10.tar.gz
cd lzo-2.10
./configure -prefix=/usr/local/hadoop/lzo/
make
make install
-
編譯hadoop-lzo原始碼
2.1 下載hadoop-lzo的原始碼,下載地址:https://github.com/twitter/hadoop-lzo/archive/master.zip
2.2 解壓之後,修改pom.xml
<hadoop.current.version
2.3 宣告兩個臨時環境變數
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include export LIBRARY_PATH=/usr/local/hadoop/lzo/lib
2.4 編譯
進入hadoop-lzo-master,執行maven編譯命令mvn package -Dmaven.test.skip=true
2.5 進入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即編譯成功的hadoop-lzo元件
二、Hadoop相關配置
1)將編譯好後的hadoop-lzo-0.4.20.jar 放入hadoop-3.1.3/share/hadoop/common/
[[email protected] common]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/common
[[email protected] common]$ ls
hadoop-lzo-0.4.20.jar
2)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104
[[email protected] common]$ xsync hadoop-lzo-0.4.20.jar
3)core-site.xml增加配置支援LZO壓縮
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>
4)同步core-site.xml到hadoop103、hadoop104
[[email protected] hadoop]$ xsync core-site.xml
5)啟動及檢視叢集
[[email protected] hadoop-3.1.3]$ sbin/start-dfs.sh
[[email protected] hadoop-3.1.3]$ sbin/start-yarn.sh
三、專案經驗之LZO建立索引
-
建立LZO檔案的索引,LZO壓縮檔案的可切片特性依賴於其索引,故我們需要手動為LZO壓縮檔案建立索引。若無索引,則LZO檔案的切片只有一個。
hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
-
測試
(1)將bigtable.lzo(200M)上傳到叢集的根目錄[[email protected] module]$ hadoop fs -mkdir /input [[email protected] module]$ hadoop fs -put bigtable.lzo /input
(2)執行wordcount程式
[[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce- examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output1
(3)對上傳的LZO檔案建索引
[[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/bigtable.lzo
(4)再次執行WordCount程式
[[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce- examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output2
-
注意:如果以上任務,在執行過程中報如下異常
Container [pid=8468,containerID=container_1594198338753_0001_01_000002] is running 318740992B beyond the ‘VIRTUAL’
memory limit. Current usage: 111.5 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used.
Killing container.
Dump of the process-tree for container_1594198338753_0001_01_000002 :
解決辦法:在hadoop102的/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml檔案中增加如下配置,然後分發到hadoop103、hadoop104伺服器上,並重新啟動叢集。
<!--是否啟動一個執行緒檢查每個任務正使用的實體記憶體量,如果任務超出分配值,則直接將其殺掉,預設是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否啟動一個執行緒檢查每個任務正使用的虛擬記憶體量,如果任務超出分配值,則直接將其殺掉,預設是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>