MapReduce 讀寫 lzo 壓縮檔案詳細

阿新 • • 發佈：2018-11-02

問題：
用java編寫mapreduce程式時，lzo格式作為輸入跟用文字作為輸入一樣，可以把lzo檔案當做文字直接使用，但是一個lzo檔案會分在一個map上，如果lzo檔案過大，希望用多個map時，調整mapred.min.split.size和mapred.max.split.size就不好使了。

解決方法：
lzo檔案建索引，索引檔案與lzo檔案同名，字尾為.index，其方法為應用 hadoop-lzo-0.4.17.jar 包執行
hadoop jar $HADOOP_HOME/lib/hadoop-lzo-0.4.17.jar com.hadoop.compression.lzo.LzoIndexer hdf://inputpath（程式輸入路徑.lzo）

設定job的inputformat:預設的是TextInputFormat，這裡要改成job.setInputFormatClass(LzoTextInputFormat.class)
加上這些之後，再設定mapred.min.split.size和mapred.max.split.size就可以調整map個數了

設定輸出為lzo 壓縮檔案
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class);
int result = job.waitForCompletion(true) ? 0 : 1;

    //上面的語句執行完成後，會生成最後的輸出檔案，可以在此基礎上新增lzo的索引
    LzoIndexer lzoIndexer = new LzoIndexer(conf);
    lzoIndexer.index(new Path(args[1]));

注 mapReduce 改變 map d的並行數
原理：改變資料輸入的分片數(block 數）即改變MAPReduce 中切分的最大最小配置即配置 MapReduce 的 main 方法中
配置

      System.out.println(Arrays.toString(args));
	Configuration config = new Configuration();
	config.setLong("mapred.min.split.size",33554432);
	config.setLong("mapred.max.split.size",67108864);

MapReduce 讀寫 lzo 壓縮檔案詳細

MapReduce 讀寫 lzo 壓縮檔案詳細

【C++】C++ 檔案讀寫 ofstream和ifstream詳細用法

pandas21 讀csv檔案read_csv（5.文字資料讀寫例項）（詳細 tcy）

pandas21 讀csv檔案read_csv（1.文字讀寫概要）（詳細 tcy）

MapReduce讀寫orc檔案

spark下讀寫lzo檔案（java）

Android讀寫properties配置檔案

Node.js讀寫中文內容檔案操作

C++讀寫ini配置檔案GetPrivateProfileString WritePrivateProfileStr

C++ 讀寫utf-8檔案

使用Python讀寫/追加excel檔案

Python讀寫txt文字檔案的操作方法全解析

Qt下讀寫XML格式檔案（使用QDomDocument類）

MFC在Unicode字符集下讀寫ANSI編碼檔案

C 檔案讀寫（二進位制檔案）

GetPrivateProfileString、WritePrivateProfileString讀寫.ini配置檔案應用例項

Java 讀寫Properties配置檔案

ASP 如何讀寫一個文字檔案

位元組流與字元流，位元組流和字元流的使用哪個多？ java 讀寫操作大檔案 BufferedReader和RandomAccessFile

adb 讀寫模式掛載檔案系統

MapReduce 讀寫 lzo 壓縮檔案 詳細

相關推薦

MapReduce 讀寫 lzo 壓縮檔案詳細