hadoop mapreduce開發實踐之輸出數據壓縮
阿新 • • 發佈:2018-02-02
實踐 shuff file apr 存儲 壓縮 ras 最終 item 1、hadoop 輸出數據壓縮
1.1、為什麽要壓縮?
- 輸出數據較大時,使用hadoop提供的壓縮機制對數據進行壓縮,可以指定壓縮的方式。減少網絡傳輸帶寬和存儲的消耗;
- 可以對map的輸出進行壓縮(map輸出到reduce輸入的過程,可以shuffle過程中網絡傳輸的數據量)
- 可以對reduce的輸出結果進行壓縮(最終保存到hdfs上的數據,主要是減少占用HDFS存儲)
mapper和reduce程序都不需要更改,只需要在streaming程序運行中指定參數即可;
-jobconf "mapred.compress.map.output=true" -jobconf "mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -jobconf "mapred.output.compress=true" -jobconf "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" \
1.2、 run_streaming程序
#!/bin/bash HADOOP_CMD="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/bin/hadoop" STREAM_JAR_PATH="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.13.0.jar" INPUT_FILE_PATH="/input/The_Man_of_Property" OUTPUT_FILE_PATH="/output/wordcount/CacheArchiveCompressFile" $HADOOP_CMD fs -rmr -skipTrash $OUTPUT_FILE_PATH $HADOOP_CMD jar $STREAM_JAR_PATH -input $INPUT_FILE_PATH -output $OUTPUT_FILE_PATH -jobconf "mapred.job.name=wordcount_wordwhite_cacheArchivefile_demo" -jobconf "mapred.compress.map.output=true" -jobconf "mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -jobconf "mapred.output.compress=true" -jobconf "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -mapper "python mapper.py WHF.gz" -reducer "python reducer.py" -cacheArchive "hdfs://localhost:9000/input/cachefile/wordwhite.tar.gz#WHF.gz" -file "./mapper.py" -file "./reducer.py"
1.3、 執行程序
$ chmod +x run_streaming_compress.sh
$ ./run_streaming_compress.sh
... 中間輸出省略 ...
18/02/02 10:51:50 INFO streaming.StreamJob: Output directory: /output/wordcount/CacheArchiveCompressFile
1.4、 查看結果
$ hadoop fs -ls /output/wordcount/CacheArchiveCompressFile Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2018-02-02 10:51 /output/wordcount/CacheArchiveCompressFile/_SUCCESS -rw-r--r-- 1 hadoop supergroup 81 2018-02-02 10:51 /output/wordcount/CacheArchiveCompressFile/part-00000.gz $ hadoop fs -get /output/wordcount/CacheArchiveCompressFile/part-00000.gz ./ $ gunzip part-00000.gz $ cat part-00000 and 2573 had 1526 have 350 in 1694 or 253 the 5144 this 412 to 2782
2、hadoop streaming 語法參考
- http://blog.51cto.com/balich/2065419
hadoop mapreduce開發實踐之輸出數據壓縮