004簡單介紹WordCount，統計文本單詞次數

阿新 • • 發佈：2018-09-05

override map() inter 根據 tasks mat import values com

MapReduce簡介

MapReduce是一種分布式計算模型,主要解決海量數據的計算問題。
MR有兩個階段組成：Map和Reduce，用戶只需實現map()和reduce()兩個函數，即可實現分布式計算。

MapReduce的原理圖

MR執行的流程

技術分享圖片

2.MR原理圖

根據代碼簡單了解MR。

package com.lj.MR;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable>  {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //super.map(key, value, context);
        String[] arr = value.toString().split(" ");
        Text keyOut = new Text();
        IntWritable valueOut = new IntWritable();
        for(String s :arr){
            keyOut.set(s);
            valueOut.set(1);
            try {
                context.write(keyOut,valueOut);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
}

package com.lj.MR;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;

import java.io.IOException;

public class WCReducce extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws 
 IOException, InterruptedException {
        //super.reduce(key, values, context);
        int count = 0;
        for(IntWritable iw:values){
             count = count + iw.get();
        }
        context.write(key,new IntWritable(count));
    }
}

package com.lj.MR;

import org.apache.hadoop.conf.Configuration;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.log4j.BasicConfigurator;


public class WCApp {
    public static void main(String[] args) {
        BasicConfigurator.configure();

        Configuration conf = new Configuration();
        //此處為本地測試
        // conf.set("fs.defaultFS","file：///D://ItTools");
        try {
            //單例模式
            Job job = Job.getInstance(conf);
            //任務作業名字
            job.setJobName("WCApp");
            //搜索類
            job.setJarByClass(WCApp.class);
            //設置輸入格式
            job.setInputFormatClass(TextInputFormat.class);


            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));


            job.setMapperClass(WCMapper.class);
            job.setReducerClass(WCReducce.class);


            job.setNumReduceTasks(1);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            job.waitForCompletion(false);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

代碼簡單解析：

根據執行流程圖我們不難發現，首先我們從Mapper下手，然後著手Reducer,而Reducer的key(in),value(in)，肯定是Mapper的key(out),value(out)，否則我們不難發現，一定會類型不匹配，直接報錯。

MAP：就是將原本文字轉換成（k,v），其中k就是word，v就是單詞的出現的次數

Shuffle：將相同的k排列一起

Reduce：將相同的k的v相加

004簡單介紹WordCount，統計文本單詞次數

override map() inter 根據 tasks mat import values com MapReduce簡介 MapReduce是一種分布式計算模型,主要解決海量數據的計算問題。 MR有兩個階段組成：Map和Reduce，用戶只需實現map()和redu

利用Python的 counter內置函數，統計文本中的單詞數量

str 集合地址 class 元組正數順序 set 而不是 counter是 colletions內的一個類可以理解為一個簡單的計數器，可以統計字符出現的個數，例子如下 import collections str1=[‘a‘,‘a‘,‘b‘,‘d‘] m=col

004 樹形控件TreeCtrlDemo 超級文本框RictEditDemo

asc 樹形控件 null getchar sta nco lock log form #樹形控件TreeCtrlDemo 　　拖拽控件　　修改ID名稱 IDC_TREE 　　添加變量名位 m_tree 　　設置控件屬性 Always Show Selection

讀取文本信息，拆分文本信息，根據拆分的文本信息保存在字典中

img == ttext collect image string num 讀取文本 add using System.Collections;using System.Collections.Generic;using UnityEngine; public cla

軟件工程作業個人項目： wc項目，統計文本文件的字符數、單詞數和行數。

platform 行數文本文件 chang cpp word 文件的 string || 1、代碼來源： http://www.cnblogs.com/changjiangcheng/p/5304120.html 2、platform: windows VC++

TextBlock 重寫，當文本過長時，自動截斷文本並出現Tooltip

XML oca center res clr des glob ner edi 如下： using System; using System.Collections.Generic; using System.Linq; using System.Text; using

任意一個英文的純文本文件，統計其中的單詞出現的個數（shell python 兩種語言實現）

統計文本英文單詞個數 python shell sort uniq 現有plain text titled test.txt，統計其中的單詞出現的個數。 test.txt的內容： i have have application someday oneday day demo i have some one c

關於Linux，用戶，組，權限，文本處理工具，正則表達式，vim文本編輯器

rtx 元字符否則權限 tdi 行編輯 directory e2fs 登錄一、用戶 ??在Linux系統中，可以創建多個用戶，每一個用戶都有一個與其對應的ID號，就像每一個人都有一個×××號一樣，這就是用戶的UID，??在Linux中管理員 root的默認UID

css文本截字，超出文本省略號顯示

hid .com space ica vertica word color alt 效果一、單行文本截字 p { text-overflow: ellipsis;/*顯示省略號代替裁剪的文本*/ white-space: nowrap;/*空白處理方式

重寫serialize方法，使文本框在沒有輸入的情況下，使用默認值

cti sele check eset box class TE rop func jQuery.fn.extend({ serialize: function () { return jQuery.param(this.serializeArray()) }, ser

HTTP協議，超文本傳輸協議

strong 聊天讀取基本結構操作應用發送一次基本 HTTP協議，超文本傳輸協議a.Http協議現在使用的是1.1的版本b.Http協議是應用層協議，底層要求使用可靠傳輸協議傳輸數據。通常傳輸層協議使用Tcp協議c.Tcp協議規定兩臺計算機之間如何傳輸數據。

HTTP協議（HyperText Transfer Protocol，超文本傳輸協議）

plt 原始的 perl ica 建立連接 transfer https 內容類型事務處理 HTTP協議（HyperText Transfer Protocol，超文本傳輸協議）是因特網上應用最為廣泛的一種網絡傳輸協議，所有的WWW文件都必須遵守這個標準。HTTP是一個基

xpath的使用：定位，獲取文本和屬性值

world src @class foo posit on() .text value oot myPage = ‘‘‘<html><title>TITLE</title><body><h1></h1>

簡單介紹一下，PHP版本的區別

以為這個已經寫過了，發現沒有，趕緊補充下。 PHP的版本，自從進入5以後，釋出新版本速度明顯提升很多，從PHP5.2開始，5.3 、5.4 、5.5，就快要6.0了。注：ecshop使用者請自覺使用php5.2.17版本。呵呵。在

TCP傳輸控制協議（初步簡單介紹一下，後結針對各部分詳細陳述）

1、ARQ（automatic repeatable request）和重傳涉及定時器（RTO）、序列號、ACK報文 2、分組視窗和滑動視窗傳送視窗結構如下圖：接收視窗結構如下圖： 3、變數視窗大小由流量控制和擁塞控制決定流量控制：是針對收發方的視

python統計文本中的單詞數和print的兩種寫法

for Coding split() number err app split exc words #!/usr/bin/python # - * - coding: utf-8 - * - #作用，分別計算每個文本的單詞數，並且輸出所有文本的單詞總數 a = [] sum

mysql簡單介紹一對一，一對多，多對多關係處理辦法

一對一關係示例：一個學生對應一個學生檔案材料，或者每個人都有唯一的身份證編號。一對多關係示例：一個學生只屬於一個班，但是一個學院有多名學生。多對多關係示例：一個學生可以選擇多門課，一門課也有多名學生。這三種關係在資料庫中邏輯結構處理分析： 1.一對多關係處理：我

Linux統計文本中某個字符串出現的次數

技術分享 png inf 文本兩種 log 出現打開 mage 常用的有如下兩種方式： 1.VIM 用vim打開文件，然後輸入： :%s/hello//gn 如下圖：圖中的例子就是統計文本中”hello”字符串出現的次數 2.GREP配合wc命令 grep -o &

Spark Streaming從Kafka中獲取數據，並進行實時單詞統計，統計URL出現的次數

scrip 發送消息 rip mark 3.2 umt 過程 bject ttr 1、創建Maven項目創建的過程參考：http://blog.csdn.net/tototuzuoquan/article/details/74571374 2、啟動Kafka A:安裝ka

c++ 在指定長度的陣列或者容器中，統計元素出現的次數（count）

#include <iostream> // cout #include <algorithm> // count #include <vector> // vector using namespace std; int ma

004簡單介紹WordCount，統計文本單詞次數

相關推薦