1. 程式人生 > >hadoop 壓縮-snappy

hadoop 壓縮-snappy

下載安裝Apache hadoop-1.2.1(bin.tar.gz檔案)搭建集群后,在執行wordcount 時報警告 WARN snappy.LoadSnappy: Snappy native library not loaded。

我們想要給Hadoop叢集增加snappy壓縮支援。很多發行版的hadoop已經內建了snappy/lzo壓縮,比如cloudera CDH, Hortonworks HDP. 但是Apache發行版安裝包大多不帶壓縮支援。(Apache hadoop-.1.21 RPM版本Hadoop (hadoop-1.2.1-1.x86_64.rpm
)已經有snappy支援,但其hadoop-1.2.1-bin.tar.gz 並無壓縮支援)

1. snappy安裝

1. 給OS安裝 g++:

centos:
yum -y update gcc
yum -y install gcc+ gcc-c++
 
ubuntu:
apt-get update gcc
apt-get install g++

到解壓後的目錄依次執行:

1) ./configure

2) make

3) make check

4) make install

snappy預設安裝目錄為/usr/local/lib , 可以用ls /usr/local/lib 命令看到其下有 libsnappy.so 等檔案。 

3.將生成的libsnappy.so放到$HADOOP_HOME/lib/native/Linux-amd64-64. 重啟hadoop叢集,這時 Hadoop已經具有snappy壓縮功能。

4. 執行 Wordcount之前 設定環境變數LD_LIBRARY_PATH使其包含libsnappy.so的目錄 ( export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64:/usr/local/lib )

5.再次執行wordcount 可以看到之前的warn消失。

如果設定job的輸出結果為snappy壓縮,在hdfs上能看到輸出目錄包含一個part-r-00000.snappy的檔案

------以上結果在cenos6.6 64bit minimal和Ubuntu 12.04 64bit server中驗證通過

2. Hadoop job中使用壓縮:

可在mapred-site.xml 中設定 或者針對job在 code中設定。

job中間結果(map輸出)使用壓縮:

---mrV1:

<property>
  <name>mapred.compress.map.output</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compression.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

---YARN:

<property>
  <name>mapreduce.map.output.compress</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>

</property>

job最終結果使用壓縮:

---mrV1:

<property>     
  <name>mapred.output.compress</name>
  <value>true</value>   
</property>   

<property>     
   <name>mapred.output.compression.codec</name>
   <value>org.apache.hadoop.io.compress.SnappyCodec</value>   
</property>

<property>     
   <name>mapred.output.compression.type</name>
   <value>BLOCK</value>  
   <description> For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended. </description>
</property>

---YARN:

<property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
  </property>

  <property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>

  <property>
    <name>mapreduce.output.fileoutputformat.compress.type</name>
    <value>BLOCK</value>
    <description>For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended.</description>

  </property>

3 讀hdfs上snappy壓縮結果程式碼示例

        /**
	 * 程式執行時 需要設定LD_LIBRARY_PATH,使其包含含有libsnappy.so的目錄.對於Windows,需要設定PATH, 使其包含含有snappy.dll的目錄
	 * 
	 * @param file hdfs file, such as hdfs://hadoop-master-node:9000/user/hadoop/wordcount/output/part-r-00000.snappy
	 * @throws Exception
	 */
	public void testReadOutput_Snappy2(String file) throws Exception {
		Configuration conf = new configuration();
		conf.set("fs.default.name", "hdfs://hadoop-master-node:9000");
		FileSystem fs = FileSystem.get(conf);
		CompressionCodecFactory factory = new CompressionCodecFactory(conf);
		CompressionCodec codec = factory.getCodec(new Path(file));
		if (codec == null) {
			System.out.println("Cannot find codec for file " + file);
			return;
		}
		CompressionInputStream in = codec.createInputStream(fs.open(new Path(
				file)));
		BufferedReader br = null;
		String line;
		try {
			br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
			while ((line = br.readLine()) != null) {
				System.out.println(line);
			}
		} finally {
			if (in != null) {
				br.close();
			}
		}
	}