【MapReduce例項】資料去重

阿新 • • 發佈：2019-01-10

一、例項描述

資料去重是利用並行化思想來對資料進行有意義的篩選。統計大資料集上的資料種類個數、從網站日誌中計算訪問等這些看似龐大的任務都會涉及資料去重。

比如，輸入檔案
file1.txt，其內容如下：
2017-12-9 a
2017-12-10 b
2017-12-11 c
2017-12-12 d
2017-12-13 a
2017-12-14 b
2017-12-15 c
2017-12-11 c

file2.txt，其內容如下：
2017-12-9 b
2017-12-10 a
2017-12-11 b
2017-12-12 d
2017-12-13 a
2017-12-14 c
2017-12-15 d
2017-12-11 c

對應上面給出的輸入樣例，其輸出樣例為：
2017-12-9 a
2017-12-9 b
2017-12-10 a
2017-12-10 b
2017-12-11 b
2017-12-11 c
2017-12-12 d
2017-12-13 a
2017-12-14 b
2017-12-14 c
2017-12-15 c
2017-12-15 d

二、設計思路

由於要去除重複的資料，我們可以考慮直接將一行資料作為Map和Reduce函式處理後的key值。
這裡寫圖片描述

1. job的處理過程如圖所示
（1）Map函式設計
Map函式的實現目的：
<1, 2017-12-9 a> ——> <2017-12-9 a, “ ”>

輸入的每一行的資料都當作key，value賦空格即可，因此Map函式的設計如下：

public static class DedupCleanMapper extends Mapper<LongWritable, Text, Text, Text> {

        private static Text line = new Text();

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            line = value;
            context.write(line, new 
 Text(""));
        }
    }

（2）Reduce函式設計
Reduce函式的實現目的：

由於重複的資料需要剔除，於是對於同樣的key不需進行匯聚操作，直接儲存key值即可，因此Reduce函式的設計如下：

public static class DedupCleanReducer extends Reducer<Text, Text, Text, Text> {
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            context.write(key, new Text(""));
        }
    }

三、完整程式碼

package com.walker.mrdemo;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DedupClean {

    /*
     * Map函式
     */
    public static class DedupCleanMapper extends Mapper<LongWritable, Text, Text, Text> {

        private static Text line = new Text();

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            line = value;
            context.write(line, new Text(""));
        }
    }

    /*
     * Reduce函式
     */
    public static class DedupCleanReducer extends Reducer<Text, Text, Text, Text> {
        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            context.write(key, new Text(""));
        }
    }

    // 輸入輸出路徑設定
    private static final String FILE_IN_PATH = "hdfs://192.168.50.130:9000/mrdemo/DedupClean/input";
    private static final String FILE_OUT_PATH = "hdfs://192.168.50.130:9000/mrdemo/DedupClean/output";

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        // 刪除已存在的輸出目錄
        FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);
        if (fileSystem.exists(new Path(FILE_OUT_PATH))) {
            fileSystem.delete(new Path(FILE_OUT_PATH), true);
        }

        Job job = Job.getInstance(conf, "DedupClean");

        job.setJarByClass(DedupClean.class);
        job.setMapperClass(DedupCleanMapper.class);
        job.setReducerClass(DedupCleanReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));
        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        job.waitForCompletion(true);
    }
}

【MapReduce例項】資料去重

一、例項描述資料去重是利用並行化思想來對資料進行有意義的篩選。統計大資料集上的資料種類個數、從網站日誌中計算訪問等這些看似龐大的任務都會涉及資料去重。比如，輸入檔案 file1.txt，其內容如下： 2017-12-9 a 2017-12-10 b

【HihoCoder - 1850】字母去重（字串，思維）

題幹：給定一個字串S，每次操作你可以將其中任意一個字元修改成其他任意字元。請你計算最少需要多少次操作，才能使得S中不存在兩個相鄰的相同字元。 Input 只包含小寫字母的字串S。 1 ≤ |S| ≤ 100000 Output 一個整數代表答案

使用Hadoop的MapReduce來實現資料去重

最近在系統學習大資料知識，學了沒有記錄過幾天又忘光了，所以把學習內容記錄下來，方便以後檢視 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.

大資料_Shuffle、MapReduce程式設計案例(資料去重、多表查詢、倒排索引、使用單元測試)

一、什麼是Shuffle（洗牌） ----> MapReduce核心 1、序列化 2、排序 3、分割槽 4、合併二、MapReduce程式設計案例 ------> 掌握方法：如何開發一個程式 1、資料

Hadoop—MapReduce練習（資料去重、資料排序、平均成績、倒排索引）

1. wordcount程式先以簡單的wordcount為例。 Mapper： package cn.nuc.hadoop.mapreduce.wordcount; import java.io.IOException; import org.apache.com

【DB】sqlite3 資料去重與萬用字元

背景使用sqlite3的命令實現資料去重，與無效資料刪除等操作。所有操作均封裝在shell script中。建立資料庫郵件資料庫：UserEmail.db Email表：TABLE_EMAIL #!/bin/bash sqlite3 User

【C#】list 去重

student AR 復制泛型沒有 obb 去重 archive func 原文:【C#】list 去重 Enumerable.Distinct 方法是常用的LINQ擴展方法，屬於System.Linq的Enumerable方法，可用於去除數組、集合中的重復

【Python】列表去重方法

如題：python中列表去重，使用三種基礎方法。使用集合集合中的元素是唯一的，所以利用集合進行去重 list1 = [2, 3, 56, 5, 5, 3 ] def func1(list1): ''''' 使用集合 ''' re

【JavaScript】陣列去重陣列求差集、交集

去重：陣列去重得分以下三種情況：數組裡是數字、數組裡是字串、數組裡是物件。前兩種直接使用jquery提供的unique方法就可實現。一、數字：

MapReduce案例3——求簡單資料去重

資料去重源資料： 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c 2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4

【sklearn例項】5--資料標準化/歸一化

標準化&歸一化 Standardization: z-score標準化（-1 ~ 1）將資料均值轉化為0，標準差轉化為1。處理後的資料符合標準正態分佈。 Normalization: min-max歸一化（0 ~ 1）利用最大最

【前端筆試題】陣列去重

陣列去重題目來自於自己真實筆試，現在總結到博文，算是給自己的再一次複習吧，另外也可以與大家分享。最初的實現我記得我第一次的答案是這樣寫的： for (var i = 0; i < arr1.length; i++) { f

Hadoop MapReduce開發--資料去重

環境 hadoop-2.9.1 windows7 idea15 示例資料 file1.txt和file2.txt檔案儲存在路徑：C:\bigdata\example_data\mr_example\exp_02\ file1.txt 2012-3-1 a 2012

【dfs+dp+桶排序去重】洛谷P1441 砝碼稱重

大致思路：首先看一下這道題：https://blog.csdn.net/m0_38033475/article/details/80380467你對比一下會發現，都是求“方案數”的，其實都是用“01揹包”來做的：對本題來說，f[j]的值表示重量為j時的方案數（每個方案的重量和

MapReduce處理資料去重與資料排序

一：MapReduce處理資料去重 Map的key具有資料去重的功能 /* * 去除資料中相同資料 * 資料去重問題 * 以整個資料作為key傳送出去, value為null */ public class DelsameMap extends Mapper<

【福利貼】拿去！1024湊個整！

來說說“技術宅的異世界” A：“借我1000塊。”B：“拿去，1024塊，我給你湊了個整兒。” 1024 - 只有技術宅才懂的日子！今天是我們程式猿的節日，祝大家節日快樂的同時。也將給大家帶來幾波福利！！！福利一：噹噹計

大量資料去重：Bitmap點陣圖演算法和布隆過濾器(Bloom Filter)

Bitmap演算法與其說是演算法，不如說是一種緊湊的資料儲存結構。是用記憶體中連續的二進位制位(bit)，用於對大量整型資料做去重和查詢。其實如果並非如此大量的資料，有很多排重方案可以使用，典型的就是雜湊表。實際上，雜湊表為每一個可能出現的數字提供了一個一一對映的關係，每個元素都相當於有

設計模式---建造者模式【含例項】

建造者模式（Client、Director、Builder和Product） Builder負責Product類物件的具體過程構建，Director負責指導Build，要求Builder按照其指定的順序去完成Product的構造。最後通過Builder返回建造後的結果。適用

設計模式----單例模式【含例項】

單例模式，非常常見的一種設計模式。需求一個類提供訪問該類物件的唯一方式，且全域性中有且僅有唯一一個該類的例項。實現方式 1.建構函式private，類外不可建立類例項 2.提供訪問類例項的介面getInstance 3.建立static private的類物件

設計模式----觀察者模式【含例項】

日常學習C++設計模式中... 給自己留個備份，有問題歡迎溝通交流。好了，開始嘍~ --------------------------------------------------------------------------------------------------

【MapReduce例項】資料去重

一、例項描述

二、設計思路

三、完整程式碼

相關推薦