hadoop 壓縮-snappy
下載安裝Apache hadoop-1.2.1(bin.tar.gz檔案)搭建集群后,在執行wordcount 時報警告 WARN snappy.LoadSnappy: Snappy native library not loaded。
我們想要給Hadoop叢集增加snappy壓縮支援。很多發行版的hadoop已經內建了snappy/lzo壓縮,比如cloudera CDH, Hortonworks HDP. 但是Apache發行版安裝包大多不帶壓縮支援。(Apache hadoop-.1.21 RPM版本Hadoop (hadoop-1.2.1-1.x86_64.rpm
)已經有snappy支援,但其hadoop-1.2.1-bin.tar.gz 並無壓縮支援)
1. snappy安裝
1. 給OS安裝 g++:
centos:
yum -y update gcc
yum -y install gcc+ gcc-c++
ubuntu:
apt-get update gcc
apt-get install g++
到解壓後的目錄依次執行:
1) ./configure
2) make
3) make check
4) make install
snappy預設安裝目錄為/usr/local/lib , 可以用ls /usr/local/lib 命令看到其下有 libsnappy.so 等檔案。
3.將生成的libsnappy.so放到$HADOOP_HOME/lib/native/Linux-amd64-64. 重啟hadoop叢集,這時 Hadoop已經具有snappy壓縮功能。
4. 執行 Wordcount之前 設定環境變數LD_LIBRARY_PATH使其包含libsnappy.so的目錄 ( export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64:/usr/local/lib )
5.再次執行wordcount 可以看到之前的warn消失。
如果設定job的輸出結果為snappy壓縮,在hdfs上能看到輸出目錄包含一個part-r-00000.snappy的檔案
------以上結果在cenos6.6 64bit minimal和Ubuntu 12.04 64bit server中驗證通過
2. Hadoop job中使用壓縮:
可在mapred-site.xml 中設定 或者針對job在 code中設定。
job中間結果(map輸出)使用壓縮:
---mrV1:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
---YARN:
<property><name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
job最終結果使用壓縮:
---mrV1:
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description> For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended. </description>
</property>
---YARN:
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
<description>For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended.</description>
</property>
3 讀hdfs上snappy壓縮結果程式碼示例
/**
* 程式執行時 需要設定LD_LIBRARY_PATH,使其包含含有libsnappy.so的目錄.對於Windows,需要設定PATH, 使其包含含有snappy.dll的目錄
*
* @param file hdfs file, such as hdfs://hadoop-master-node:9000/user/hadoop/wordcount/output/part-r-00000.snappy
* @throws Exception
*/
public void testReadOutput_Snappy2(String file) throws Exception {
Configuration conf = new configuration();
conf.set("fs.default.name", "hdfs://hadoop-master-node:9000");
FileSystem fs = FileSystem.get(conf);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(new Path(file));
if (codec == null) {
System.out.println("Cannot find codec for file " + file);
return;
}
CompressionInputStream in = codec.createInputStream(fs.open(new Path(
file)));
BufferedReader br = null;
String line;
try {
br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} finally {
if (in != null) {
br.close();
}
}
}