分散式處理框架——MapReduce

阿新 • • 發佈：2018-11-06

1 MapReduce 優點

海量資料離線處理；
易開發，易執行；

2 MapReduce 程式設計模型

將作業拆分成 Map 階段和 Reduce 階段
Map階段： Map Tasks
Reduce階段： Reduce Tasks

2.1 wordcount 案例

在這裡插入圖片描述

2.2 核心概念

Split: 交由 MapReduce 作業來處理的資料塊，是MapReduce 中最小的計算單元；HDFS: blocksize 是HDFS 中的最小儲存單元，128M；預設情況下：Split 和 block 是一一對應的，也可以手動設定他們的關係（不建議）

InputFormat
OutputFormat
Combiner
Partitioner

3 MapReduce 架構

3.1 版本1.x

JobTracker : JT, 作業管理者，將作業分解成一堆的任務：Task(MapTask , ReduceTask);將任務分配給 TaskTracker 執行；作業的監控、容錯處理（task作業掛了，重啟 task的機制）; 在一定的時間間隔內，JT沒有收到 TT 的心跳，TT可能掛了，這個TT上指派的任務可能被指派到其他TT上執行；
TaskTracker: TT, 任務的執行者，在TT 上執行 Task(MapTask , ReduceTask); 會與JT進行互動：執行、啟動、停止作業；傳送心跳資訊給 JT;

MapTask: 自己開發的 map 任務交由該 Task 處理；將 map 的輸出結果寫到本地磁碟；
ReduceTask : 對 Map Task 輸出的資料進行讀取；按照資料進行分組傳給我們自己編寫的 reduce 方法處理；

3.2 MapReduce2.x

在這裡插入圖片描述

3 wordcount 案例

3.1 原始碼

package com.bzt.cn.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org. 
apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


import java.io.FileOutputStream;
import java.io.IOException;

/*
 * MapReduce 版 wordcount
 * */
public class WordCountApp {

    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split(" ");
            for (String word : words) {
                context.write(new Text(word), one);
            }
        }

    }

    /*
     * Reducer ; 歸併操作
     * */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        long sum = 0;

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            for (LongWritable value : values) {
                sum += value.get();
            }

            context.write(key, new LongWritable(sum));
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

        // 建立 job
        Job job = Job.getInstance(conf, "WC");
        // 設定 job 處理類
        job.setJarByClass(WordCountApp.class);
        //設定作業處理的輸入路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //設定 map 相關的引數
        job.setMapperClass(MyMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //設定作業處理的輸出路徑
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

3.2 maven 打包成jar,上傳到叢集

執行 [[email protected] ~]$ hadoop jar wordcount.jar com.bzt.cn.mapreduce.WordCountApp hdfs://node1:8020/hello.txt hdfs://node1:8020/wcout

[[email protected] ~]$ hadoop fs -ls /wcout
18/10/30 09:38:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-10-30 09:38 /wcout/_SUCCESS
-rw-r--r--   1 hadoop supergroup         30 2018-10-30 09:38 /wcout/part-r-00000
[[email protected] ~]$ hadoop fs -text /wcout/part-r-00000
18/10/30 09:39:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello	4
jerry	5
tom	7
world	8
[[email protected] ~]$

3.4 增強版

 public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

//        清理已經存在的輸出目錄
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(conf);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath);
            System.out.println("output file deleted!");
        }

        // 建立 job
        Job job = Job.getInstance(conf, "WC");
        // 設定 job 處理類
        job.setJarByClass(WordCountApp.class);
        //設定作業處理的輸入路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //設定 map 相關的引數
        job.setMapperClass(MyMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //設定作業處理的輸出路徑
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

4 `Combiner`

本地的 reducer
減少Map Tasks 輸出的資料量及資料網路傳輸量
適用場景：求和，次數

5 `Partitioner`

Partitioner 決定 MapTask 輸出的資料交由哪個ReduceTask處理
預設實現：分發的 key 的 hash 值對 Reduce Task 個數取模

5.1 測試資料

在這裡插入圖片描述

5.2 原始碼

package com.bzt.cn.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class PartitionerApp {

    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split(" ");

            context.write(new Text(words[0]),new LongWritable(Long.parseLong(words[1])));
        }

    }

    /*
     * Reducer ; 歸併操作
     * */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        long sum = 0;

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            for (LongWritable value : values) {
                sum += value.get();
            }

            context.write(key, new LongWritable(sum));
        }
    }

    public static class MyPartitioner extends Partitioner<Text,LongWritable>{

        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("dog")){
                return 0;
            }
            if(key.toString().equals("cat")){
                return 1;
            }
            if(key.toString().equals("duck")){
                return 2;
            }

            return 3;
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

//        清理已經存在的輸出目錄
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(conf);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath);
            System.out.println("output file deleted!");
        }

        // 建立 job
        Job job = Job.getInstance(conf, "WC");
        // 設定 job 處理類
        job.setJarByClass(PartitionerApp.class);
        //設定作業處理的輸入路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //設定 map 相關的引數
        job.setMapperClass(MyMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);


        //設定 job 的partition
        job.setPartitionerClass(MyPartitioner.class);
        //設定 4 個 reducer,每個分割槽一個
        job.setNumReduceTasks(4);

        //設定作業處理的輸出路徑
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

5.3 maven 打包，在叢集執行

[[email protected] ~]$ hadoop jar part.jar com.bzt.cn.mapreduce.PartitionerApp hdfs://node1:8020/animal.txt hdfs://node1:8020/partionerout

[[email protected] ~]$ hadoop fs -ls /partionerout
18/10/30 10:39:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 hadoop supergroup          0 2018-10-30 10:38 /partionerout/_SUCCESS
-rw-r--r--   1 hadoop supergroup          6 2018-10-30 10:38 /partionerout/part-r-00000
-rw-r--r--   1 hadoop supergroup          6 2018-10-30 10:38 /partionerout/part-r-00001
-rw-r--r--   1 hadoop supergroup          7 2018-10-30 10:38 /partionerout/part-r-00002
-rw-r--r--   1 hadoop supergroup          8 2018-10-30 10:38 /partionerout/part-r-00003
[[email protected] ~]$ hadoop fs -cat /partionerout/part-r-00000
18/10/30 10:40:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
dog	7
[[email protected] ~]$ hadoop fs -cat /partionerout/part-r-00001
18/10/30 10:40:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
cat	6
[[email protected] ~]$ hadoop fs -cat /partionerout/part-r-00002
18/10/30 10:40:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
duck	7
[[email protected] ~]$ hadoop fs -cat /partionerout/part-r-00003
18/10/30 10:40:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
lion	13
[[email protected] ~]$

6 jobhistory

記錄已執行完的 MapReduce 資訊到指定的HDFS目錄
預設關閉

6.1 配置 jobhistory

/home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/etc/hadoop
mapred-site.xml

<configuration>
	<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    
    <property>
     	<name>mapreduce.jobhistory.address</name>
        <value>node1:10020</value>
    </property>
    
     <property>
     	<name>mapreduce.jobhistory.webapp.address</name>
        <value>node1:19888</value>
    </property>
    
    <property>
     	<name>mapreduce.jobhistory.done-dir</name>
        <value>/history/done</value>
    </property>
    
    <property>
     	<name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>/history/done_intermediate</value>
    </property>
    
    
    
</configuration>

6.2 啟動 history server

重啟一下 yarn

[[email protected] ~]$ mr-jobhistory-daemon.sh start historyserver

[[email protected] ~]$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/logs/mapred-hadoop-historyserver-node1.out
[[email protected] ~]$ jps
6704 JobHistoryServer
6738 Jps
1395 DataNode
6245 ResourceManager
1271 NameNode
1559 SecondaryNameNode
6346 NodeManager
[[email protected] ~]$

6.3 測試

[[email protected] ~]$ cd /home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce2
[[email protected] mapreduce2]$ clear
[[email protected] mapreduce2]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 2 3

訪問 http://node1:19888/jobhistory
在這裡插入圖片描述
點進去看 log

聚合沒有開啟
配置 yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
	<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    
</configuration>

重啟 yarn,再跑一次 pi

分散式處理框架——MapReduce

1 MapReduce 優點海量資料離線處理；易開發，易執行； 2 MapReduce 程式設計模型將作業拆分成 Map 階段和 Reduce 階段 Map階段： Map Tasks Reduce階段： Reduce Tasks

大資料之hadoop分散式計算框架MapReduce

一、MapReduce構成 MapReduce分為兩部分，即map和reduce。其中map是入隊（key，value），reduce則是聚合（計算）。 map過程的輸出時reduce過程的輸入。需要注意的是這裡map中的key是可以重複的，reduce做聚

分散式資料處理框架：Apache Beam

1簡介大資料時代中，資料從簡單的批處理，擴充套件到實時處理、流處理。起初的MapReduce處理模式早已獨木難支。此外，大資料處理技術也是百花齊放，如 HBase、Hive、Kafka、Spark、Flink 等，對開發者而言，想要將其全部熟練運用幾乎是一項不可能完成的任

幾種資料處理框架的場景比較：傳統ETL工具、Mapreduce、Hive、Spark

ref: http://www.sohu.com/a/155141436_151779提起“大資料”就不得不提起有關資料的處理，雖然有人說過大資料在資料質量方面的要求不比傳統資料的要求那麼嚴格，當然這也是分場景的斷言，但是無論何時資料處理在大資料的生態中始終處於不可缺少的地位

大資料時代之hadoop(五)：hadoop 分散式計算框架（MapReduce）

hadoop的核心分為兩塊，一是分散式儲存系統-hdfs，這個我已經在上一章節大致講了一下，另一個就是hadoop的計算框架-mapreduce。 mapreduce其實就是一個移動式的基於key-value形式的分散式計算框架。

多線程實現簡單的事件異步處理框架

void 序列化以及 ise 包含反序列化高程結合映射老實說，多線程在web開發裏面非常常見，很多web容器本身就支持多線程，所以很多時候我們在進行web開發的時候並不需要考慮多線程相關的負責問題，而只需要實現相關的業務功能即可。所以，可以概括地講，很多時候的w

流式處理框架對比

流式處理處理 ont 調優 cnblogs eve 上線用戶避免分布式流處理是對無邊界數據集進行連續不斷的處理、聚合和分析的過程，與MapReduce一樣是一種通用計算框架，期望延遲在毫秒或者秒級別。這類系統一般采用有向無環圖(DAG)。DAG是任務鏈的圖形化表示

大數據處理框架

系統 ogl 數量 hdfs 特點沒有 ive .cn lap 說起大數據處理啊，一切都起源於Google公司的經典論文。在當時（2000年左右），由於網頁數量急劇增加，Google公司內部平時要編寫很多的程序來處理大量的原始數據：爬蟲爬到的網頁、網頁請求日誌；計算各

系統設計分布式處理之MapReduce

大量數據基礎傳輸了解過大基礎知識 pro buffer 閱讀方法介紹 MapReduce是一種計算模型，簡單的說就是將大批量的工作（數據）分解（MAP）執行，然後再將結果合並成最終結果（REDUCE）。這樣做的好處是可以在任務被分解後，可以通過大量機器進行並行計

基於QProbe創建基本Android圖像處理框架

基本 nbsp lin eight clas 識別應該 edi pan 先來看一個GIF這個GIF中有以下幾個值得註意的地方這個界面是基本的主要界面所應該在的地方。其右下角有一個“+”號，點擊後，打開圖像采集界面在這個界面最上面的地方，顯示的是當前圖像處理的狀態。（一般來

分布式計算框架MapReduce

Hadoop MapReduce 大數據分布式計算框架 JobHistory MapReduce概述 MapReduce源自Google的MapReduce論文，論文發表於2004年12月。Hadoop MapReduce可以說是Google MapReduce的一個開源實現。MapRe

流式處理框架storm淺析

min 完成文檔 efault 生成沒有 ado 編程現在前言前一段時間參與哨兵流式監控功能設計，調研了兩個可以做流式計算的框架：storm和spark streaming，我負責storm的調研工作。斷斷續續花了一周的時間看了官網上的doc和網絡上的一些資料。我把

基於dubbo的分散式專案框架搭建開發工具idea (springboot+dubbo+zookeeper+redis+rabbitmq+基於Swagger2的restful api) --(二)

1.dubbo-admin 2.5.8的安裝 http://dubbo.apache.org/en-us/index.html 點選GITHUB 跳轉到github下載dubbo原始碼點選Branch切換到Tags，找到對應的2.5.8版本，下載該版本，下載解壓完以後

基於dubbo的分散式專案框架搭建開發工具idea (springboot+dubbo+zookeeper+redis+rabbitmq+基於Swagger2的restful api) --(一)

1. spring-boot web框架下載配置 https://start.spring.io/ 點選Switch to the full version 勾選詳細的配置根據需要更改group atrifact...等資訊 &nb

基於dubbo的分散式專案框架搭建開發工具idea (springboot+dubbo+zookeeper+redis+rabbitmq+基於Swagger2的restful api) --(四)

1.rabbitmq的整合首先在配置檔案裡增加 #rabbitMQ spring.rabbitmq.host=127.0.0.1 spring.rabbitmq.port=5672 spring.rabbitmq.username=root spring.rabbitmq.password

基於dubbo的分散式專案框架搭建開發工具idea (springboot+dubbo+zookeeper+redis+rabbitmq+基於Swagger2的restful api) --(三)

1.dubbo註冊中心zookeeper 的安裝 http://mirrors.shu.edu.cn/apache/zookeeper/ 選擇對應的版本下載即可修改zoo_sample.cfg為zoo.cfg(可以先備份一份) 在安裝目錄新建data和log目錄修改zo

阿里巴巴分散式服務框架dubbo學習筆記

Dubbo是什麼？ Dubbo是一個分散式服務框架，致力於提供高效能和透明化的RPC遠端服務呼叫方案，以及SOA服務治理方案。簡單的說，dubbo就是個服務框架，如果沒有分散式的需求，其實是不需要用的，只有在分散式的時候，才有dubbo這樣的分散式服務框架的需求，並且本質上是個服務呼叫的東東，說

Dubbo阿里Alibaba開源的分散式服務框架

Dubbo是什麼？ Dubbo是阿里巴巴SOA服務化治理方案的核心框架，每天為2,000+個服務提供3,000,000,000+次訪問量支援，並被廣泛應用於阿里巴巴集團的各成員站點。 Dubbo是一個分散式服務框架，致力於提供高效能和透明化的RPC遠端服務呼叫方案，以及SOA服務治理方案。其

微服務中基於Spring Boot的maven分散式專案框架的搭建

專案介紹在微服務架構中，傳統的 maven 專案已經無法滿足，開始走向分散式架構，本專案主要搭建一個空的 maven 分散式架構，可以運用到實際專案中進行擴充套件，可以在文末獲取原始碼和更多資源。這裡搭建的是基於 maven 的分散式工程，因為在一個專案中，多個微服務是屬於同

分散式服務框架原理與實踐

傳統垂直應用架構：MVC架構(Spring+Struts+Hibernate/iBatis+Tomcat)、LAMP架構（linux+Apache+PHP+MySQL） MVC &nbs

分散式處理框架——MapReduce

1 MapReduce 優點

2 MapReduce 程式設計模型

2.1 wordcount 案例

2.2 核心概念

3 MapReduce 架構

3.1 版本1.x

3.2 MapReduce2.x

3 wordcount 案例

3.1 原始碼

3.2 maven 打包成jar,上傳到叢集

3.4 增強版

4 Combiner

5 Partitioner

5.1 測試資料

5.2 原始碼

5.3 maven 打包，在叢集執行

6 jobhistory

6.1 配置 jobhistory

6.2 啟動 history server

6.3 測試

相關推薦

4 `Combiner`

5 `Partitioner`