python實現單詞計數的mapreduce
阿新 • • 發佈:2019-02-07
map函式
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words :
print "%s\t%s" % (word , 1)
reduce函式
import sys current_word=None current_count=0 for line in sys.stdin: line=line.strip() word=line.split("\t",1) if current_word==word[0]:#當前單詞如果為本次傳過來的單詞,則計數加一 current_count=current_count+1 if current_word==None:#第一次判斷當前單詞是否為空,若為空,賦值,計數為一 current_word=word[0] current_count=current_count+1 elif current_word!=word[0]:#當前單詞如果不為本次傳過來的,則先把當前的輸出,再賦值,計數 print "%s\t%s" %(current_word,current_count) current_count=1 current_word=word[0] print "%s\t%s" %(current_word,current_count)#列印迴圈結束後,最後一次的單詞
測試:
echo "hello word hello Hadoop map reduce" | ./mapper.py |sort -k1,1| ./reducer.py
Python只能對排好序的單詞進行計數,在Hadoop中會實現對單詞的排序
在Hadoop上執行:
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file test/code/mapper.py -mapper test/code/mapper.py \
-file test/code/reducer.py -reducer test/code/reducer.py \
-input /user/rte/hdfs_in/* -output /user/rte/hdfs_out