MapReduce例項

阿新 • • 發佈：2019-01-18

本文參考http://michaelnielsen.org/blog/page/19/
從MapReduce的經典例子—單詞統計開始。
一個MapReduce job的輸入是一個（input_key, input_value)這樣的鍵值對集合。鍵值對集合可以使用python的dictionary資料型別來表示。在單詞統計例子中，input_key是檔名，input_value是檔案內容。

filenames = ['a.txt', 'b.txt', 'c.txt']
i = {}
for filename in filenames:
    f = open(filename)
    i[filename] = f.read()
    f.close()

python dictionary i 包含著MapReduce job的所有輸入。a.txt, b.txt和c.txt的內容如下：

text\a.txt:

The quick brown fox jumped over the lazy grey dogs.

text\b.txt:

That's one small step for a man, one giant leap for mankind.

text\c.txt:

Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to 
 go.

一個MapReduce job分為兩個階段：map階段和reduce階段。map階段產生 intermediate keys and values，這些 intermediate keys and values再由reduce階段處理。在map階段，一個mapper函式mapper(input_key,input_value)處理每個MapReduce job的輸入字典i中的鍵值對(input_key,input_value)。mapper(input_key,input_value)產生由intermediate keys and values組成的列表。mapper("a.txt", i["a.txt"])

產生：

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), 
 ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

mapper函式定義如下

 def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

定義這樣的mapper函式後，map階段的產出就是針對輸入的字典i呼叫mapper函式（ mapper(“a.txt”), mapper(“b.txt”)和 mapper(“c.txt”) ）所返回結果的合併：

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), 
 ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), 
 ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), 
 ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), 
 ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), 
 ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), 
 ('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1), 
 ('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1), 
 ('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1), 
 ('mankind', 1)]

接下來進入reduce階段
MapReduce為reduce階段做一些預處理：將map階段產生的intermediate keys and values列表中含有相同key的value放在一起，生成一箇中間字典intermediate dictionary：

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 
 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 
 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 
 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 
 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 
 'that': [1], 'little': [1], 'small': [1], 'step': [1], 
 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 
 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 
 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

reduce階段呼叫reducer函式，reducer(intermediate_key,intermediate_value_list)作用在intermediate dictionary中的每一項上。單詞統計的例子中，reducer函式將intermediate_key對應的
intermediate_value_list中的值加起來：

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

intermediate dictionary經過reduce階段的處理，輸出：

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), 
 ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), 
 ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), 
 ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), 
 ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), 
 ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), 
 ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), 
 ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

參考程式：

#word_count.py

import string
import map_reduce

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

print map_reduce.map_reduce(i,mapper,reducer)

map_reduce模組：

# map_reduce.py

import itertools

def map_reduce(i,mapper,reducer):
  intermediate = []
  for (key,value) in i.items():
    intermediate.extend(mapper(key,value))
  groups = {}
  for key, group in itertools.groupby(sorted(intermediate), 
                                      lambda x: x[0]):
    groups[key] = list([y for x, y in group])
  return [reducer(intermediate_key,groups[intermediate_key])
          for intermediate_key in groups]

MapReduce例項

偽分散式執行Hadoop例項之yarn執行MapReduce例項

Hadoop的MapReduce例項講解—Python寫的WordCount Demo

執行一個mapreduce例項

Mapreduce例項-分組排重（group by distinct）

【Hadoop】Windows 10 在Intellij IEDA本地執行Hadoop MapReduce例項

7.測試hadoop安裝成功與否，並跑mapreduce例項

【MapReduce例項】資料去重

hadoop入門（六）JavaAPI+Mapreduce例項wordCount單詞計數詳解

MapReduce例項

[hadoop]MapReduce例項之好友推薦（六）

hodoop中使用MapReduce例項

MapReduce例項淺析

Mapreduce例項---分割槽流量彙總

Mapreduce例項---統計單詞個數（wordcount）

一個mapreduce例項加註釋

Hadoop2.0 Mapreduce例項WordCount體驗

Mapreduce資料分析例項

hadoop-mapreduce-examples Hadoop例項

解讀MapReduce程式例項

大資料之Hadoop學習——動手實戰學習MapReduce程式設計例項

MapReduce例項

相關推薦