MapReduce如何使用多路輸出
Streaming支援多路輸出(SuffixMultipleTextOutputFormat)
如下示例:
hadoop streaming \
-input /home/mr/data/test_tab/ \
-output /home/mr/output/tab_test/out19 \
-outputformatorg.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat\ # 指定outputformat為org.apache.hadoop.mapred.lib.SuffixMultipleTextOutputFormat
-jobconf suffix.multiple.outputformat.filesuffix=a,c,f,abc,cde
-jobconf suffix.multiple.outputformat.separator="#"\ # 設定value與檔名的分割符,預設為“#”,如果value本身含有“#”,則可以通過該引數設定其他的分隔符
-mapper "cat" \
-reducer "sh reduce.sh" \
-file reduce.sh
注:標記為紅色的引數必須設定,引數說明請見註釋
Map或者reduce裡需要在每個記錄的reduce追加“#+檔名”
#!/bin/bash
while read line
do
key=$(echo $line | awk -F' ' '{print $1}')
value=$(echo $line | awk -F' ' '{print $2}')
if [ "$key" == "a" ]
then
echo"$key $value#a"
fi
if [ "$key" == "c" ]
then
echo "$key $value#c"
fi
if [ "$key" =="f" ]
then
echo "$key $value#f"
fi
if [ "$key" =="abc" ]
then
echo "$key $value#abc"
fi
if [ "$key" =="cde" ]
then
echo "$key $value#cde"
fi
done