1. 程式人生 > >大資料-Hadoop-MapReduce (二) WrodCount單詞計算

大資料-Hadoop-MapReduce (二) WrodCount單詞計算

Hadoop-MapReduce (二) -WrodCount單詞計算

一句話理解: 將很多很多的文字檔案遍歷計算出每一個單詞出現的次數
-擴充套件閱讀TF-IDF詞頻-逆向文件頻率

(WordCount).單詞計算

有文字如下: a b c b b c c d c 需得到結果為: a 1 b 3 c 4 d 1  原理如圖:
1)Map 將每一行的單詞計數為1 Map<word,1>
// 輸入為一行行的資料 其中 LongWritable key為下標,Text value 為這一行文字
	// 假設這一行資料為 b c d e e e e
	public static class TokenizerMapper extends Mapper {
		protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)
				throws IOException, InterruptedException {
			String lineStr = value.toString();// 得到一行文字
			// 使用空格分離 預設引數為空格
			StringTokenizer words = new StringTokenizer(lineStr);
			while (words.hasMoreElements()) {
				String word = words.nextToken();// 得到這個單詞
				//if(word.contains("Maturity"))
					// 交這個單詞計數+1
					context.write(new Text(word), new IntWritable(1));// 輸出到map
			}
		}
	}
2)Shuffling 對每一個單詞進行分類合併 Map<word,<1,1>> 3)Reduce 對每一個單詞累加 word = 1 + 1
// input e1 e1 e1 e1
	// output e4
	//public static class IntSumReducer extends Reducer {
    public static class IntSumReducer extends Reducer {
        public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
        	int count = 0;
			// String word = key.toString();
			for (IntWritable intWritable : values) {
				// 迴圈
				count += intWritable.get();
			}
			// 輸出
				context.write(key, new IntWritable(count));
        }
    }
4)Job運算
public class WordCount {

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();
		String inputPath = "input/wordcount";
		String outputPath = "output/wordcount";
		// String[] otherArgs = (new GenericOptionsParser(conf,
		// args)).getRemainingArgs();
		String[] otherArgs = new String[] { inputPath, outputPath }; /* 直接設定輸入引數 */
		// delete output
		Path outputPath2 = new Path(outputPath);
		outputPath2.getFileSystem(conf).delete(outputPath2, true);

		
		// run
		if (otherArgs.length < 2) {
			System.err.println("Usage: wordcount  [...] ");
			System.exit(2);
		}

		Job job = Job.getInstance(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(WordCount.TokenizerMapper.class);
		//job.setCombinerClass(WordCount.IntSumReducer.class);
		job.setReducerClass(WordCount.IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//output file total
		//job.setNumReduceTasks(1);//reducer task num  
		for (int i = 0; i < otherArgs.length - 1; ++i) {
			FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
		}

		FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
轉載請註明出處,謝謝!