1. 程式人生 > >Hadoop支援Lzo壓縮

Hadoop支援Lzo壓縮

1.前置要求

  • 編譯安裝好hadoop

  • java & maven 安裝配置好

  • 安裝前置庫

     yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool
    

2.安裝 lzo

2.1 下載

  #下載
  wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
  
  # 解壓縮
  [[email protected] app]$ tar -zxvf lzo-2.06.tar.gz -C ..
/app

2.2 增加引數

[[email protected] app]$ cd lzo-2.06/
[[email protected] lzo-2.06]$ export CFLAGS=-m64

# 建立資料夾,用來存放編譯之後的lzo
[[email protected] lzo-2.06]$ mkdir complie

#指定編譯之後的位置
[[email protected] lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/complie/

#開始編譯安裝
[[email protected]
lzo-2.06]$ make && make install # 檢視編譯是否成功 只要有如下內容 就可以了 [[email protected] lzo-2.06]$ cd complie/ [[email protected] complie]$ ll total 12 drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 include drwxrwxr-x 2 hadoop hadoop 4096 Dec 6 17:08 lib drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 share [
[email protected] complie]$

3. 安裝hadoop-lzo

3.1 下載 & 解壓

# 下載
[[email protected] soft]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip

#解壓
[[email protected] soft]$ unzip master

# 如果提示沒有 unzip  記得用yum 安裝下
[[email protected] ~]# yum -y install unzip

3.2 修改hadoop-lzo-master下的pom.xml檔案

   <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.6.0</hadoop.current.version> #這裡修改成對應的hadoop版本號
    <hadoop.old.version>1.0.4</hadoop.old.version>
  </properties>

3.3 增加配置

[[email protected] app]$ cd hadoop-lzo-master/
[[email protected] hadoop-lzo-master]$ export CFLAGS=-m64
[[email protected] hadoop-lzo-master]$  export CXXFLAGS=-m64
[[email protected] hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/complie/include/     # 這裡需要提供編譯好的lzo的include檔案
[[email protected] hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib/           # 這裡需要提供編譯好的lzo的lib檔案
[[email protected] hadoop-lzo-master]$  

3.4 開始編譯

[[email protected] hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true

出現 BUILD SUCCESS 的時候 說明成功!

3.5 執行如下操作

[[email protected] hadoop-lzo-master]$ 
# 檢視編譯成功之後的包
[[email protected] hadoop-lzo-master]$ ll
total 80
-rw-rw-r--  1 hadoop hadoop 35147 Oct 13  2017 COPYING
-rw-rw-r--  1 hadoop hadoop 19753 Dec  6 17:18 pom.xml
-rw-rw-r--  1 hadoop hadoop 10170 Oct 13  2017 README.md
drwxrwxr-x  2 hadoop hadoop  4096 Oct 13  2017 scripts
drwxrwxr-x  4 hadoop hadoop  4096 Oct 13  2017 src
drwxrwxr-x 10 hadoop hadoop  4096 Dec  6 17:21 target

# 進入target/native/Linux-amd64-64 目錄下執行如下命令
[[email protected] hadoop-lzo-master]$ cd target/native/Linux-amd64-64
[[email protected] Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
./
./libgplcompression.so
./libgplcompression.so.0
./libgplcompression.la
./libgplcompression.a
./libgplcompression.so.0.0.
[[email protected] Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/


# 這裡很重要  需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 複製到hadoop中
[[email protected] hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/ 
[[email protected] hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib

4.配置hadoop配置檔案

4.1 修改 vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# 增加 編譯好的lzo包下的lib
export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib

4.2 修改 vim $HADOOP_HOME/etc/hadoop/core-site.

<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
    </value>
</property>
<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

4.3 修改 vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
    <name>mapred.child.env </name>
    <value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
</property>
<property>
    <name>mapreduce.map.output.compress</name>
    <value>true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

4.4 重啟hadoop

5.hadoop使用Lzo

5.1 資料準備

我準備好了一個大的資料檔案,使用lzo壓縮下

# 原始大小
[[email protected] data]$ ls -lh
total 5.0G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec  5 17:58 access.20161111.log
[[email protected] data]$ 
# 使用lzo 壓縮
lzop access.20161111.log
# 壓縮之後的大小
[[email protected] data]$ ls -lh
total 5.9G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec  5 17:58 access.20161111.log
-rw-r--r-- 1 hadoop hadoop 878M Dec  5 17:58 access.20161111.log.lzo
[[email protected] data]$ 

5.2 上傳資料到hdfs中

# 上傳
[[email protected] data]$ hdfs dfs -put access.20161111.log.lzo /data

#檢視上傳結果
[[email protected] data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r--   1 hadoop supergroup  920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
[[email protected] data]$ 

5.3 執行hadoop wc 應用

[[email protected] mapreduce]$ hadoop jar \
hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
/data/access.20161111.log.lzo \
/out

檢視執行過程,可以看到 ** number of splits:1** ,說明 hadoop並沒有 給我的lzo檔案切片


18/12/06 18:39:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:39:00 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:39:00 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:39:00 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0001
18/12/06 18:39:01 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0001
18/12/06 18:39:01 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0001/
18/12/06 18:39:01 INFO mapreduce.Job: Running job: job_1544089631050_0001

5.4 給lzo檔案建立索引

通過之前的學習,我知道,如果使用lzo壓縮的話,需要有lzo的索引檔案,接下來我們生產索引檔案

[[email protected] hadoop-2.6.0-cdh5.7.0]$ hadoop jar \
share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/data/access.20161111.log.lzo

[[email protected] mapreduce]$ hdfs dfs -ls /data
Found 2 items
-rw-r--r--   1 hadoop supergroup  920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
-rw-r--r--   1 hadoop supergroup     163088 2018-12-06 18:42 /data/access.20161111.log.lzo.index
[[email protected] mapreduce]$ 

如上我的索引檔案以及生成,那我繼續執行wc程式,看我的hadoop是否能支援lzo檔案的切片

[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/access.20161111.log.lzo /out1
18/12/06 18:45:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:45:02 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:45:02 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:45:02 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0003
18/12/06 18:45:03 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0003
18/12/06 18:45:03 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0003/
18/12/06 18:45:03 INFO mapreduce.Job: Running job: job_1544089631050_0003

從上面的執行過程,我看出來我的hadoop還是不能將我的lzo檔案給切片,接著翻閱資料…

5.5 更改

通過翻閱資料得知,單純的做了索引還是不行的,在執行程式的時候還要對要執行的程式進行相應的更改,
把inputformat設定成LzoTextInputFormat,不然還是會把索引檔案也當做是輸入檔案,還是隻執行一個map來處理。

所以修改我的提交任務的方式 增加 -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat

[[email protected] mapreduce]$ 
[[email protected]doop000 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/data/access.20161111.log.lzo \
/out3
18/12/06 18:48:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:48:40 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:48:40 INFO mapreduce.JobSubmitter: number of splits:7
18/12/06 18:48:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0005
18/12/06 18:48:41 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0005
18/12/06 18:48:41 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0005/
18/12/06 18:48:41 INFO mapreduce.Job: Running job: job_1544089631050_0005
^C^C[[email protected] mapreduce]$ ^C

從上述結果看出來hadoop已經能自動的將我的lzo檔案給我切片了~~

成功~~