map/reduce例項wordCount單詞計數實現功能

阿新 • • 發佈：2019-02-10

hadoop
hadoop
hadoop
dajiangtai
dajiangtai
dajiangtai
hsg
qq.com
hello you
hello me  her

map/reduce處理功能

執行步驟：
 1. map任務處理
1.1 讀取輸入檔案內容，解析成key、value對。對輸入檔案的每一行，解析成key、value對。每一個鍵值對呼叫一次map函式。
1.2 寫自己的邏輯，對輸入的key、value處理，轉換成新的key、value輸出。
1.3 對輸出的key、value進行分割槽。
1.4 對不同分割槽的資料，按照key進行排序、分組。相同key的value放到一個集合中。
1.5 
 (可選)分組後的資料進行歸約。

2.reduce任務處理
2.1 對多個map任務的輸出，按照不同的分割槽，通過網路copy到不同的reduce節點。
2.2 對多個map任務的輸出進行合併、排序。寫reduce函式自己的邏輯，對輸入的key、values處理，轉換成新的key、value輸出。
2.3 把reduce的輸出儲存到檔案中。

1.3和1.4,1.5是hadoop自動幫我們做的，
我們做的就是上面寫的map函式的輸出邏輯1.2

map函式重寫功能

1.2 寫自己的邏輯，對輸入的key、value處理，轉換成新的key、value輸出。

//定義map
    //LongWritable, Text, Text, LongWritable  前兩個引數為輸入的map型別，後兩個引數為輸出的map型別 

    //如<0,hello you>,<10,hello me>  ---> <hello,1><you,1><hello,1><me,1>
    public static class myMapper    extends Mapper<LongWritable, Text, Text, LongWritable>{

        //定義一個k2,v2
        Text k2 = new Text();
        LongWritable v2 = new LongWritable();
        //輸入map型別key,value值為<0,hello you><10,hello me> 

        //0或10為行起始位元組資料
        @Override       
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words= value.toString().split(" ");
            for (String word:words) {           
                if(word.trim().isEmpty()==false)
                {
                     Debug.println(word,"1");
                     //word表示第一行中的每個單詞,即k2
                     k2.set(word);
                     //沒排序分組前每個單詞都是1個，由於是long型別所以加L
                     v2.set(1L);
                     //寫出
                     context.write(k2, v2);  
                }
            } 
        }
    }

reduce函式重寫功能

2.1和2.2功能由hadoop幫我們做了，我們只需要寫自己的邏輯reduce函式
2.2 對多個map任務的輸出進行合併、排序。寫reduce函式自己的邏輯，對輸入的key、values處理，轉換成新的key、value輸出。

//下面這個myReducer函式是輸出<k3,v3>的函式，邏輯要我們自己寫
public static class myReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        LongWritable v3=new LongWritable();
        //k2,v2s引數形式為<hello,{1,1}><you,{1}>變為---><hello,2><you,1>
        @Override
        protected void reduce(Text k2,Iterable<LongWritable> v2s,
                     Reducer<Text,LongWritable,Text,LongWritable>.Context context) 
                     throws IOException,InterruptedException{
            long count=0L;
            for(LongWritable v2:v2s) {
                count +=v2.get();
            }
            v3.set(count);
            //k2就是k3,都是一個單詞
            context.write(k2,v3);
        }
    }

package wordcount;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class wordcount {    
    //定義map
    //LongWritable, Text, Text, LongWritable  前兩個引數為輸入的map型別，後兩個引數為輸出的map型別
    //如<0,hello you>,<10,hello me>  ---> <hello,1><you,1><hello,1><me,1>
    public static class myMapper    extends Mapper<LongWritable, Text, Text, LongWritable>{

        //定義一個k2,v2
        Text k2 = new Text();
        LongWritable v2 = new LongWritable();
        //輸入map型別key,value值為<0,hello you><10,hello me>
        //0或10為行起始位元組資料
        @Override       
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words= value.toString().split(" ");
            for (String word:words) {           
                if(word.trim().isEmpty()==false)
                {
                     Debug.println(word,"1");
                     //word表示第一行中的每個單詞,即k2
                     k2.set(word);
                     //沒排序分組前每個單詞都是1個，由於是long型別所以加L
                     v2.set(1L);
                     //寫出
                     context.write(k2, v2);  
                }
             }   
        }
    }
    public static class myReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        LongWritable v3=new LongWritable();
        //k2,v2s引數形式為<hello,{1,1}><you,{1}>變為---><hello,2><you,1>
        @Override
        protected void reduce(Text k2,Iterable<LongWritable> v2s,
                     Reducer<Text,LongWritable,Text,LongWritable>.Context context) 
                     throws IOException,InterruptedException{
            long count=0L;
            for(LongWritable v2:v2s) {
                count +=v2.get();
            }
            v3.set(count);
            //k2就是k3,都是一個單詞
            context.write(k2,v3);
        }
    }
    //刪除輸出目錄
    public static void deleteOutDir(Configuration conf,String out_dir)
               throws IOException,URISyntaxException{
        FileSystem fs=FileSystem.get(new URI(out_dir),conf);
        if(fs.exists(new Path(out_dir))==true)
        {
            fs.delete(new Path(out_dir),true);
        }
    }
    public static void main(String[] args) throws Exception
    {
        //載入hadoop conf 驅動
        Configuration conf=new Configuration(); 
        Job job=Job.getInstance(conf,wordcount.class.getSimpleName());
        job.setJarByClass(wordcount.class);
        Path in_path=new Path("hdfs://192.168.145.180:8020/user/root/input/djt.txt");
        FileInputFormat.setInputPaths(job, in_path);
        //通過TextInputFormat把讀到的資料處理成<k1,v1>形式
        job.setInputFormatClass(TextInputFormat.class);
        //job中加入Mapper，同時MyMapper類接受<k1,v1>作為引數傳給類中map函式進行資料處理
        job.setMapperClass(myMapper.class);
        //設定輸出的<k2,v2>的資料型別
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //job中加入Reducer,Reducer自動接收處理好的map資料
        job.setReducerClass(myReducer.class);
        //設定輸出的<k3,v3>的資料型別
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //設定輸出目錄檔案output2
        String OUT_DIR = "hdfs://192.168.145.180:8020/user/root/output2";
        FileOutputFormat.setOutputPath(job, new Path(OUT_DIR));
        job.setOutputFormatClass(TextOutputFormat.class);
        //如果這個檔案存在則刪除，如果檔案存在不刪除會報錯。
        deleteOutDir(conf, OUT_DIR);
        //把處理好的<k3,v3>的資料寫入檔案
        job.waitForCompletion(true);
    }
}

執行方法一:在eclipse開發環境中執行

wordcount.java編輯器中右鍵run as / run on hadoop執行OK

執行方法二:生成jar拷到hadoop伺服器上執行

生成jar方法：

Project Explorder專案工程樹型中選擇wordcoun.java右鍵Export/Java/Runnable Jar file
選擇項和輸入項
Launch configuration： wordcount-wordcount
export destination：D:\應用集合\eclipse\eclipse-workspace\bin_jar\wordcount.jar
Library handling:
可勾：Extract required libraries into generated JAR
點選Finish完成

把生成的jar拷到hadoop服務上執行
拷到/home/hadoop3/app/hadoop目錄中

執行：hadoop jar wordcount.jar

具體執行過程如下：

[[email protected] hadoop]$ hadoop jar wordcount.jar
18/09/03 19:22:38 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/09/03 19:22:41 INFO input.FileInputFormat: Total input paths to process : 1
18/09/03 19:22:42 INFO mapreduce.JobSubmitter: number of splits:1
18/09/03 19:22:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1535940670011_0002
18/09/03 19:22:42 INFO impl.YarnClientImpl: Submitted application application_1535940670011_0002
18/09/03 19:22:42 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1535940670011_0002/
18/09/03 19:22:42 INFO mapreduce.Job: Running job: job_1535940670011_0002
18/09/03 19:22:51 INFO mapreduce.Job: Job job_1535940670011_0002 running in uber mode : false
18/09/03 19:22:51 INFO mapreduce.Job:  map 0% reduce 0%
18/09/03 19:23:01 INFO mapreduce.Job:  map 100% reduce 0%
18/09/03 19:23:12 INFO mapreduce.Job:  map 100% reduce 100%
18/09/03 19:23:12 INFO mapreduce.Job: Job job_1535940670011_0002 completed successfully
18/09/03 19:23:12 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=235
        FILE: Number of bytes written=255343
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=205
        HDFS: Number of bytes written=65
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=7766
        Total time spent by all reduces in occupied slots (ms)=6780
        Total time spent by all map tasks (ms)=7766
        Total time spent by all reduce tasks (ms)=6780
        Total vcore-milliseconds taken by all map tasks=7766
        Total vcore-milliseconds taken by all reduce tasks=6780
        Total megabyte-milliseconds taken by all map tasks=7952384
        Total megabyte-milliseconds taken by all reduce tasks=6942720
    Map-Reduce Framework
        Map input records=10
        Map output records=14
        Map output bytes=201
        Map output materialized bytes=235
        Input split bytes=116
        Combine input records=0
        Combine output records=0
        Reduce input groups=9
        Reduce shuffle bytes=235
        Reduce input records=14
        Reduce output records=9
        Spilled Records=28
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=169
        CPU time spent (ms)=2070
        Physical memory (bytes) snapshot=312471552
        Virtual memory (bytes) snapshot=4165705728
        Total committed heap usage (bytes)=152428544
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=89
    File Output Format Counters 
        Bytes Written=65
[[email protected] hadoop]$

表示執行成功OK
其執行結果為：
hdfs://192.168.145.180:8020/user/root/output2

dajiangtai  3
hadoop  3
hello   2
her 1
hsg 1
me  1
qq.com  1
you 1

如果執行報錯：RunJar jarFile [mainClass] args…
則你可能採用Export/Java/Jar file生成的方式，沒有指定mainclass導致的問題
改用Export/Java/Runnable Jar file方式就OK，只不過生成的jar會比較大。
–the–end—-

map/reduce例項wordCount單詞計數實現功能

map函式重寫功能

reduce函式重寫功能

執行方法一:在eclipse開發環境中執行

執行方法二:生成jar拷到hadoop伺服器上執行

生成jar方法：

map/reduce例項wordCount單詞計數實現功能

hadoop入門（六）JavaAPI+Mapreduce例項wordCount單詞計數詳解

Scala +Spark+Hadoop+Zookeeper+IDEA實現WordCount單詞計數（簡單例項）

Scala+Spark+Hadoop+IDEA實現WordCount單詞計數，上傳並執行任務（簡單例項-下）

Hadoop學習：Map/Reduce初探與小Demo實現

Hadoop WordCount單詞計數原理

map-reduce之wordCount DEMO

用python的map/reduce函式實現int()功能，即字串轉換成數字

WordCount結對作業——實現擴展功能

Map Reduce用tree Map實現·topn

WordCount結對作業——實現擴充套件功能

大資料處理神器map-reduce實現(僅python和shell版本)

Hadoop之MapReduce過程，單詞計數WordCount

18 | 散列表（上）：Word文件中的單詞拼寫檢查功能是如何實現的？

大矩陣乘法運算map reduce實現思路

hadoop學習（七）WordCount+Block+Split+Shuffle+Map+Reduce技術詳解

Hadoop分佈環境搭建步驟,及自帶MapReduce單詞計數程式實現

Python遞迴實現多層巢狀dict遍歷例項--三級選單/多級選單功能

scala 兩種方法實現單詞計數

Linux 單詞計數 WordCount 以及程式碼案例

map/reduce例項wordCount單詞計數實現功能

map函式重寫功能

reduce函式重寫功能

執行方法一:在eclipse開發環境中執行

執行方法二:生成jar拷到hadoop伺服器上執行

生成jar方法：

相關推薦