MapReduce之自定義WordCount案例

阿新 • • 發佈：2018-12-16

在一堆給定的文字檔案中統計輸出每一個單詞出現的總次數。

1.分析

mapper階段：

將mapstack 傳給我們的文字資訊內容先轉換成string。
根據空格將一行切分成單詞。
將單詞輸出為<單詞，1>的格式。

reducer階段

彙總各個key的個數
輸出該key的總數

driver階段

獲取配置資訊
指定本程式的jar所在的本地路徑
關聯mapper和reducer類
指定map的輸出資料kv型別
指定最終輸出的資料的kv型別
指定job的輸入原始位置和輸出位置

提交

2. 編寫程式碼

定義一個mapper類

package com.atguigu.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * KEYIN:輸入資料的key  檔案的行號
 * VALUEIN 每行的輸入資料
 * 
 * KEYOUT ： 輸出資料的key
 * VALUEOUT: 輸出資料的value型別
 * @author Administrator
 *
 */
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

	//hello world
	//atguigu atguigu

	 
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		//1.獲取這一行資料
		String line = value.toString();
		//2.獲取每一個單詞
		
		String[] words = line.split(" ");
		//3.輸出每一個單詞
		for (String word : words) {
			context.write(new Text(word),new IntWritable(1));
		}
	}
}

2.定義一個reducer類

package com.atguigu.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
		 // 1.統計所有單詞個數
		 int count = 0;
		 for (IntWritable value : values) {
			count+= value.get();
		}
		 //2.輸出所有單詞數
		 context.write(key, new IntWritable(count));
	}
}

定義一個driver類

package com.atguigu.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


//驅動主程式
public class WordcountDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//1.獲取job物件資訊
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);
		
		//2.設定載入jar位置
		job.setJarByClass(WordcountDriver.class);
		
		//3.設定mapper和reducer的class類
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		//4.設定輸出mapper的資料型別
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		//5.設定最終資料輸出的型別
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//6.設定輸入資料和輸出資料路徑
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//7.submit
		boolean result = job.waitForCompletion(true);
		System.exit(result?0:1);
	}
}

3.執行

將上述的類打成jar包，拷貝到hadoop叢集中
啟動hadoop叢集
執行WordCount程式

開啟web端HDFS檢視結果

將 part-r-00000下載下來檢視，滿足需求。

MapReduce之自定義WordCount案例

在一堆給定的文字檔案中統計輸出每一個單詞出現的總次數。 1.分析 mapper階段：將mapstack 傳給我們的文字資訊內容先轉換成string。根據空格將一行切分成單詞。將單詞輸出為<單詞，1>的格式。 reducer階段

MapReduce 之 ---自定義全域性計數器，將資訊輸出到控制檯

package jishuqi; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.ha

Mapreduce之自定義OutputFormat應用-日誌增強

主要介紹的是自定義OutputFormat的使用，給出的需求很簡單對現有的日誌檔案內容進行增強。 1、從原始日誌檔案中讀取資料 2、根據日誌中的一個URL欄位到外部知識庫中獲取資訊增強到原始日誌 3、如果成功增強，則輸出到增強結果目錄；如果增強失敗，則

MapReduce之自定義partitioner

partitioner定義： partitioner的作用是將mapper（如果使用了combiner的話就是combiner）輸出的key/value拆分為分片（shard），每個reducer對應一個分片。預設情況下，partitioner先計算key的雜湊值（通常

MapReduce之自定義InputFormat

>在企業開發中，Hadoop框架自帶的`InputFormat`型別不能滿足所有應用場景，需要==自定義==InputFormat來解決實際問題。自定義InputFormat步驟如下： - （1）自定義一個類繼承`FilelnputFormat`。 - （2）自定義一個類繼承`RecordReader`，實

MapReduce之自定義分割槽器Partitioner

@[toc] ## 問題引出 >要求將統計結果按照條件輸出到不同檔案中（分割槽）。比如：將統計結果按照**手機歸屬地不同省份**輸出到不同檔案中（分割槽） ## 預設Partitioner分割槽 ```java public class HashPartitioner extends Partitio

MapReduce之自定義OutputFormat

@[toc] ## OutputFormat介面實現類 `OutputFormat`是`MapReduce`輸出的基類，所有實現`MapReduce`輸出都實現了`OutputFormat`介面。下面介紹幾種常見的OutputFormat實現類。 - 文字輸出`TextoutputFormat` 預設的輸出

HBase之自定義HBase-Mapreduce案例一

1.需求場景將HBase中的ys表中的一部分資料通過Mapreduce遷移到ys_mr表中 2.程式碼編寫 1)構建ReadysMapreduce類，用於讀取ys表中的資料 package cn.ysjh; import java.io.IOException;

MapReduce系列之自定義Partitioner

partitioner定義：分割槽器 partitioner的作用是將mapper（如果使用了combiner的話就是combiner）輸出的key/value拆分為分片（shard），每個reducer對應一個分片。預設情況下，partitioner先計算key的雜湊值（通常為md5值）。然後

python之自定義異步IO客戶端

class close sel 封裝 [0 urn 簡單 pytho syn #!/usr/bin/env python # -*- coding: utf8 -*- # __Author: "Skiler Hao" # date: 2017/5/16 15:04

Hadoop實戰-Flume之自定義Sink(十九)

current ioe back urn oop print out java try import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream;

C#之自定義特性

創建 tip comm 字段運算符包含自動名稱程序　　在前面介紹的代碼中有使用特性，這些特性都是Microsoft定義好的，作為.NET Framework類庫的一部分，許多特性都得到了C#編譯器的支持。　　.NET Frmework也允許定義自己的特性。自

Qt之自定義搜索框——QLineEdit裏增加一個Layout，還不影響正常輸入文字（好像是一種比較通吃的方法）

too 步驟 set box 文本 csdn sub void 鼠標簡述關於搜索框，大家都經常接觸。例如：瀏覽器搜索、Windows資源管理器搜索等。當然，這些對於Qt實現來說毫無壓力，只要思路清晰，分分鐘搞定。方案一：調用QLineEdit現

前端學PHP之自定義模板引擎

php什麽是網站模板？準確地說，是指網站頁面模板，即每個頁面僅是一個板式，包括結構、樣式和頁面布局，是創建網頁內容的樣板，也可以理解為已有的網頁框架。可以將模板中原有的內容替換成從服務器端數據庫中動態內容，目的是可以保持頁面風格一致　　PHP是一種HTML內嵌式的在服務器端執行的腳本語言，所以大部分PHP開發

Django之自定義分頁

plugin style 定義 mage user_list .sh render 進行 blog 應用於各頁面中的分頁實現，實現的結果如下圖 1.先自定義一個有關分頁的PageInfo類 1 class PageInfo(object): 2 3

每天一個JS 小demo之自定義滾動條。主要知識點：事件應用

prevent 數據滾動 sca listener 視頻希望特效 poi <!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>D

vue2.0學習筆記之自定義組件

2.0 sco ron 自定義組件定義 temp use 使用 imp step one: 推薦結構 step two: Loading.vue <template> <h3>{{msg}}<

angular之自定義管道

pan code 參數 angular log core pipe cor 替換 1，裝了angular2 的 cli之後，cmd中命令建立個管道文件 ng g p <name>; 　　如建一個在pipe文件中建一個add.pipe.ts文件可以這麽麽寫 ng

Spring Boot筆記之自定義啟動banner

bottom rule mage ack eight ooo manifest log blank 控制banner內容 Spring Boot啟動的時候默認的banner是spring的字樣，看多了覺得挺單調的，Spring Boot為我們提供了自定義banner的功

【轉】Spring MVC系列（五）之自定義數據綁定---HandlerMethodArgumentResolver

開閉 src pat 獲取參數 mvc .net 定義開閉原則淺析介紹前面幾節我們介紹了Spring MVC的幾種常見的數據綁定的方法，可以靈活地獲取用戶請求中的參數，例如@PathVariable，@ModelAttribute，@RequestPar

MapReduce之自定義WordCount案例

在一堆給定的文字檔案中統計輸出每一個單詞出現的總次數。

1.分析

2. 編寫程式碼

3.執行

相關推薦