1. 程式人生 > >python實現單詞計數的mapreduce

python實現單詞計數的mapreduce

map函式

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words :
        print "%s\t%s" % (word , 1)

reduce函式
import sys
current_word=None
current_count=0


for line in sys.stdin:
    line=line.strip()
    word=line.split("\t",1)

    if current_word==word[0]:#當前單詞如果為本次傳過來的單詞,則計數加一
        current_count=current_count+1
        
    if current_word==None:#第一次判斷當前單詞是否為空,若為空,賦值,計數為一
        current_word=word[0]
        current_count=current_count+1

    elif current_word!=word[0]:#當前單詞如果不為本次傳過來的,則先把當前的輸出,再賦值,計數
        print "%s\t%s" %(current_word,current_count)
        current_count=1
        current_word=word[0]
print "%s\t%s" %(current_word,current_count)#列印迴圈結束後,最後一次的單詞

測試:

echo "hello word hello Hadoop map reduce" | ./mapper.py |sort -k1,1| ./reducer.py 

Python只能對排好序的單詞進行計數,在Hadoop中會實現對單詞的排序


在Hadoop上執行:

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file test/code/mapper.py     -mapper test/code/mapper.py \
-file test/code/reducer.py    -reducer test/code/reducer.py \
-input /user/rte/hdfs_in/*    -output /user/rte/hdfs_out