1. 程式人生 > 其它 >Hadoop之Lzo壓縮配置

Hadoop之Lzo壓縮配置

技術標籤:叢集搭建Hadoophadoop大資料

Hadoop之Lzo壓縮配置

一、hadoop-lzo編譯

hadoop本身並不支援lzo壓縮,故需要使用twitter提供的hadoop-lzo開源元件。hadoop-lzo需依賴hadoop和lzo進行編譯,編譯步驟如下。

  1. 環境準備
    maven(下載安裝,配置環境變數,修改sitting.xml加阿里雲映象)
    gcc-c++
    zlib-devel
    autoconf
    automake
    libtool
    通過yum安裝即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool

  2. 下載、安裝並編譯LZO

    wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz

    tar -zxvf lzo-2.10.tar.gz

    cd lzo-2.10

    ./configure -prefix=/usr/local/hadoop/lzo/

    make

    make install

  3. 編譯hadoop-lzo原始碼

    2.1 下載hadoop-lzo的原始碼,下載地址:https://github.com/twitter/hadoop-lzo/archive/master.zip

    2.2 解壓之後,修改pom.xml

       <hadoop.current.version
    >
    3.1.3</hadoop.current.version>

    2.3 宣告兩個臨時環境變數

    export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
     export LIBRARY_PATH=/usr/local/hadoop/lzo/lib 
    

    2.4 編譯
    進入hadoop-lzo-master,執行maven編譯命令

    mvn package -Dmaven.test.skip=true
    

    2.5 進入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即編譯成功的hadoop-lzo元件

二、Hadoop相關配置

1)將編譯好後的hadoop-lzo-0.4.20.jar 放入hadoop-3.1.3/share/hadoop/common/

[[email protected] common]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/common
[[email protected] common]$ ls
hadoop-lzo-0.4.20.jar

2)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104

[[email protected] common]$ xsync hadoop-lzo-0.4.20.jar

3)core-site.xml增加配置支援LZO壓縮

<configuration>
    <property>
        <name>io.compression.codecs</name>
        <value>
            org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            org.apache.hadoop.io.compress.SnappyCodec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
        </value>
    </property>

    <property>
        <name>io.compression.codec.lzo.class</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
</configuration>

4)同步core-site.xml到hadoop103、hadoop104

[[email protected] hadoop]$ xsync core-site.xml

5)啟動及檢視叢集

[[email protected] hadoop-3.1.3]$ sbin/start-dfs.sh
[[email protected] hadoop-3.1.3]$ sbin/start-yarn.sh

三、專案經驗之LZO建立索引

  1. 建立LZO檔案的索引,LZO壓縮檔案的可切片特性依賴於其索引,故我們需要手動為LZO壓縮檔案建立索引。若無索引,則LZO檔案的切片只有一個。

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
    
  2. 測試
    (1)將bigtable.lzo(200M)上傳到叢集的根目錄

    [[email protected] module]$ hadoop fs -mkdir /input
    [[email protected] module]$ hadoop fs -put bigtable.lzo /input
    

    (2)執行wordcount程式

    [[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-
    examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input 	 /output1
    

    在這裡插入圖片描述

    (3)對上傳的LZO檔案建索引

    [[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar 
     com.hadoop.compression.lzo.DistributedLzoIndexer /input/bigtable.lzo
    

    (4)再次執行WordCount程式

    [[email protected] module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-
    examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat 
    /input /output2
    

    在這裡插入圖片描述

  3. 注意:如果以上任務,在執行過程中報如下異常
    Container [pid=8468,containerID=container_1594198338753_0001_01_000002] is running 318740992B beyond the ‘VIRTUAL’

memory limit. Current usage: 111.5 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. 
Killing container.
Dump of the process-tree for container_1594198338753_0001_01_000002 :

解決辦法:在hadoop102的/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml檔案中增加如下配置,然後分發到hadoop103、hadoop104伺服器上,並重新啟動叢集。

<!--是否啟動一個執行緒檢查每個任務正使用的實體記憶體量,如果任務超出分配值,則直接將其殺掉,預設是true -->
<property>
   <name>yarn.nodemanager.pmem-check-enabled</name>
   <value>false</value>
</property>

<!--是否啟動一個執行緒檢查每個任務正使用的虛擬記憶體量,如果任務超出分配值,則直接將其殺掉,預設是true -->
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
</property>