1. 程式人生 > 其它 >MIT6.824 2018 MapReduce Part II: Single-worker word count

MIT6.824 2018 MapReduce Part II: Single-worker word count

技術標籤:演算法

Part II: Single-worker word count

Now you will implement word count — a simple Map/Reduce example. Look inmain/wc.go; you'll find emptymapF()andreduceF()functions. Your job is to insert code so thatwc.goreports the number of occurrences of each word in its input. A word is any contiguous sequence of letters, as determined by

unicode.IsLetter.

There are some input files with pathnames of the formpg-*.txtin ~/6.824/src/main, downloaded fromProject Gutenberg. Here's how to runwcwith the input files:

$ cd 6.824
$ export "GOPATH=$PWD"
$ cd "$GOPATH/src/main"
$ go run wc.go master sequential pg-*.txt
# command-line-arguments
./wc.go:14: missing return at end of function
./wc.go:21: missing return at end of function

The compilation fails becausemapF()andreduceF()are not complete.

Review Section 2 of theMapReduce paper. YourmapF()andreduceF()functions will differ a bit from those in the paper's Section 2.1. YourmapF()will be passed the name of a file, as well as that file's contents; it should split the contents into words, and return a Go slice ofmapreduce.KeyValue. While you can choose what to put in the keys and values for themapFoutput, for word count it only makes sense to use words as the keys. YourreduceF()will be called once for each key, with a slice of all the values generated bymapF()for that key. It must return a string containing the total number of occurences of the key.

You can test your solution using:

$ cd "$GOPATH/src/main"
$ time go run wc.go master sequential pg-*.txt
master: Starting Map/Reduce task wcseq
Merge: read mrtmp.wcseq-res-0
Merge: read mrtmp.wcseq-res-1
Merge: read mrtmp.wcseq-res-2
master: Map/Reduce task completed
2.59user 1.08system 0:02.81elapsed

The output will be in the file "mrtmp.wcseq". Your implementation is correct if the following command produces the output shown here:

$ sort -n -k2 mrtmp.wcseq | tail -10
that: 7871
it: 7987
in: 8415
was: 8578
a: 13382
of: 13536
I: 14296
to: 16079
and: 23612
the: 29748

You can remove the output file and all intermediate files with:

$ rm mrtmp.*

To make testing easy for you, run:

$ bash ./test-wc.sh

and it will report if your solution is correct or not. 來讀一下這段Shell指令碼

#!/bin/bash
go run wc.go master sequential pg-*.txt
sort -n -k2 mrtmp.wcseq | tail -10 | diff - mr-testout.txt > diff.out
if [ -s diff.out ]
then
echo "Failed test. Output should be as in mr-testout.txt. Your output differs as follows (from diff.out):" > /dev/stderr
  cat diff.out
else
  echo "Passed test" > /dev/stderr
fi

這一部分主要是實現一個單執行緒序列化的mapreduce, 這裡的map函式是對英文文字進行分詞(這裡是去查Go的API手冊),然後新增到key-value陣列,value均為1,reduce,是將同一個key,對應的value求和。得到真實的value,介面如下:

//
// The map function is called once for each file of input. The first
// argument is the name of the input file, and the second is the
// file's complete contents. You should ignore the input file name,
// and look only at the contents argument. The return value is a slice
// of key/value pairs.
//
func mapF(filename string, contents string) []mapreduce.KeyValue {
	// Your code here (Part II).
	// function to detect word separators.
	ff := func(r rune) bool { return !unicode.IsLetter(r) }

	// split contents into an array of words.
	words := strings.FieldsFunc(contents, ff)

	kva := []mapreduce.KeyValue{}
	for _, w := range words {
		kv := mapreduce.KeyValue{w, "1"}
		kva = append(kva, kv)
	}
	return kva

}

//
// The reduce function is called once for each key generated by the
// map tasks, with a list of all the values created for that key by
// any map task.
//
func reduceF(key string, values []string) string {
	// Your code here (Part II).
	return strconv.Itoa(len(values))
}

這樣Part2就完成了。