MIT6.824 2018 MapReduce Part II: Single-worker word count
技術標籤:演算法
Part II: Single-worker word count
Now you will implement word count — a simple Map/Reduce example. Look inmain/wc.go; you'll find emptymapF()andreduceF()functions. Your job is to insert code so thatwc.goreports the number of occurrences of each word in its input. A word is any contiguous sequence of letters, as determined by
There are some input files with pathnames of the formpg-*.txtin ~/6.824/src/main, downloaded fromProject Gutenberg. Here's how to runwcwith the input files:
$ cd 6.824 $ export "GOPATH=$PWD" $ cd "$GOPATH/src/main" $ go run wc.go master sequential pg-*.txt # command-line-arguments ./wc.go:14: missing return at end of function ./wc.go:21: missing return at end of function
The compilation fails becausemapF()andreduceF()are not complete.
Review Section 2 of theMapReduce paper. YourmapF()andreduceF()functions will differ a bit from those in the paper's Section 2.1. YourmapF()will be passed the name of a file, as well as that file's contents; it should split the contents into words, and return a Go slice ofmapreduce.KeyValue. While you can choose what to put in the keys and values for themapFoutput, for word count it only makes sense to use words as the keys. YourreduceF()will be called once for each key, with a slice of all the values generated bymapF()for that key. It must return a string containing the total number of occurences of the key.
- a good read on Go strings is theGo Blog on strings.
- you can usestrings.FieldsFuncto split a string into components.
- the strconv package (http://golang.org/pkg/strconv/) is handy to convert strings to integers etc.
You can test your solution using:
$ cd "$GOPATH/src/main"
$ time go run wc.go master sequential pg-*.txt
master: Starting Map/Reduce task wcseq
Merge: read mrtmp.wcseq-res-0
Merge: read mrtmp.wcseq-res-1
Merge: read mrtmp.wcseq-res-2
master: Map/Reduce task completed
2.59user 1.08system 0:02.81elapsed
The output will be in the file "mrtmp.wcseq". Your implementation is correct if the following command produces the output shown here:
$ sort -n -k2 mrtmp.wcseq | tail -10
that: 7871
it: 7987
in: 8415
was: 8578
a: 13382
of: 13536
I: 14296
to: 16079
and: 23612
the: 29748
You can remove the output file and all intermediate files with:
$ rm mrtmp.*
To make testing easy for you, run:
$ bash ./test-wc.sh
and it will report if your solution is correct or not. 來讀一下這段Shell指令碼
#!/bin/bash
go run wc.go master sequential pg-*.txt
sort -n -k2 mrtmp.wcseq | tail -10 | diff - mr-testout.txt > diff.out
if [ -s diff.out ]
then
echo "Failed test. Output should be as in mr-testout.txt. Your output differs as follows (from diff.out):" > /dev/stderr
cat diff.out
else
echo "Passed test" > /dev/stderr
fi
這一部分主要是實現一個單執行緒序列化的mapreduce, 這裡的map函式是對英文文字進行分詞(這裡是去查Go的API手冊),然後新增到key-value陣列,value均為1,reduce,是將同一個key,對應的value求和。得到真實的value,介面如下:
//
// The map function is called once for each file of input. The first
// argument is the name of the input file, and the second is the
// file's complete contents. You should ignore the input file name,
// and look only at the contents argument. The return value is a slice
// of key/value pairs.
//
func mapF(filename string, contents string) []mapreduce.KeyValue {
// Your code here (Part II).
// function to detect word separators.
ff := func(r rune) bool { return !unicode.IsLetter(r) }
// split contents into an array of words.
words := strings.FieldsFunc(contents, ff)
kva := []mapreduce.KeyValue{}
for _, w := range words {
kv := mapreduce.KeyValue{w, "1"}
kva = append(kva, kv)
}
return kva
}
//
// The reduce function is called once for each key generated by the
// map tasks, with a list of all the values created for that key by
// any map task.
//
func reduceF(key string, values []string) string {
// Your code here (Part II).
return strconv.Itoa(len(values))
}
這樣Part2就完成了。