MapReduce案例3——求簡單資料去重

阿新 • • 發佈：2019-01-01

資料去重源資料：
2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-7 c
2012-3-3 c
2012-3-1 b
2012-3-2 a
2012-3-3 b
2012-3-4 d
2012-3-5 a
2012-3-6 c
2012-3-7 d
2012-3-3 c


最終結果：
2012-3-1 a
2012-3-1 b
2012-3-2 a
2012-3-2 b
2012-3-3 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-6 c
2012-3-7 c
2012-3-7 d

思路：將兩列資料作為key值，在reduce階段，每次輸出一個，就可以達到去重的目的

/**
 * @author: lpj   
 * @date: 2018年3月16日 下午7:16:47
 * @Description:
 */
package lpj.reduceWork;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
 *
 */
public class DistinctNumMR {
	
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
//		conf.addResource("hdfs-site.xml");//使用配置檔案
//		System.setProperty("HADOOP_USER_NAME", "hadoop");//使用叢集
		FileSystem fs = FileSystem.get(conf);//預設使用本地
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(DistinctNumMR.class);
		job.setMapperClass(DistinctNumMR_Mapper.class);
		job.setReducerClass(DistinctNumMR_Reducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(NullWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
//		
//		String inputpath = args[0];
//		String outpath = args[1];
		
		Path inputPath = new Path("d:/a/homework3.txt");
		Path outputPath = new Path("d:/a/homework3");
		if (fs.exists(inputPath)) {
			fs.delete(outputPath, true);
		}
		
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		boolean isdone = job.waitForCompletion(true);
		System.exit(isdone ? 0 : 1);
	}
	
	public static class DistinctNumMR_Mapper extends Mapper<LongWritable, Text, Text, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
			//2012-3-3 c
			String [] reads = value.toString().trim().split(" ");
			String kk = reads[0] + "\t" + reads[1];
			kout.set(kk);
			context.write(kout, NullWritable.get());
			
		}
	}
	public static class DistinctNumMR_Reducer extends Reducer<Text, NullWritable,Text, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void reduce(Text key, Iterable<NullWritable> values, Context context)throws IOException, InterruptedException {
			context.write(key, NullWritable.get());
		}
		
	}

}

執行結果：

2012-3-1	a
2012-3-1	b
2012-3-2	a
2012-3-2	b
2012-3-3	b
2012-3-3	c
2012-3-4	d
2012-3-5	a
2012-3-6	b
2012-3-6	c
2012-3-7	c
2012-3-7	d

MapReduce案例3——求簡單資料去重

資料去重源資料： 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c 2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4

大資料_Shuffle、MapReduce程式設計案例(資料去重、多表查詢、倒排索引、使用單元測試)

一、什麼是Shuffle（洗牌） ----> MapReduce核心 1、序列化 2、排序 3、分割槽 4、合併二、MapReduce程式設計案例 ------> 掌握方法：如何開發一個程式 1、資料

使用Hadoop的MapReduce來實現資料去重

最近在系統學習大資料知識，學了沒有記錄過幾天又忘光了，所以把學習內容記錄下來，方便以後檢視 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.

Java 簡單實現物件資料去重處理

對於資料庫去重相當簡單，只需要在查詢語句上新增DISTINCT關鍵字就可以搞定，但是針對Nosql資料庫而言DISTINCT未必支援，這時候我們需要自己手動實現。這裡有一個數據不重複的最簡單的原理：字串

【MapReduce例項】資料去重

一、例項描述資料去重是利用並行化思想來對資料進行有意義的篩選。統計大資料集上的資料種類個數、從網站日誌中計算訪問等這些看似龐大的任務都會涉及資料去重。比如，輸入檔案 file1.txt，其內容如下： 2017-12-9 a 2017-12-10 b

Hadoop MapReduce開發--資料去重

環境 hadoop-2.9.1 windows7 idea15 示例資料 file1.txt和file2.txt檔案儲存在路徑：C:\bigdata\example_data\mr_example\exp_02\ file1.txt 2012-3-1 a 2012

Hadoop閱讀筆記（二）——利用MapReduce求平均數和去重

前言：聖誕節來了，我怎麼能虛度光陰呢？！依稀記得，那一年，大家互贈賀卡，短短几行字，字字融化在心裡；那一年，大家在水果市場，尋找那些最能代表自己心意的蘋果香蕉梨，摸著冰冷的水果外皮，內心早已滾燙。這一年……我在部落格園-_-#，希望用dt的程式碼燃燒腦細胞，溫暖小心窩。上篇《Hadoop閱讀筆記（

Hadoop—MapReduce練習（資料去重、資料排序、平均成績、倒排索引）

1. wordcount程式先以簡單的wordcount為例。 Mapper： package cn.nuc.hadoop.mapreduce.wordcount; import java.io.IOException; import org.apache.com

MapReduce處理資料去重與資料排序

一：MapReduce處理資料去重 Map的key具有資料去重的功能 /* * 去除資料中相同資料 * 資料去重問題 * 以整個資料作為key傳送出去, value為null */ public class DelsameMap extends Mapper<

簡單排序去重掌握了幾個STL容器

http://acm.hrbust.edu.cn/index.php?m=ProblemSet&a=showProblem&problem_id=2378 vector 先排序再去重unique #include <iostr

大量資料去重：Bitmap點陣圖演算法和布隆過濾器(Bloom Filter)

Bitmap演算法與其說是演算法，不如說是一種緊湊的資料儲存結構。是用記憶體中連續的二進位制位(bit)，用於對大量整型資料做去重和查詢。其實如果並非如此大量的資料，有很多排重方案可以使用，典型的就是雜湊表。實際上，雜湊表為每一個可能出現的數字提供了一個一一對映的關係，每個元素都相當於有

java實現大批量json檔案資料去重

上週從資料採集部門拿到一批400份的json檔案,每個檔案裡30w+的json物件,物件裡有uid,對重複的uid,需要去重下. 本人電腦4核8G已經不能滿足了,總是記憶體不夠用.所以在伺服器上寫了一下(配置8核128G) ,結果讀取檔案以及去重都沒問題, 在最後的寫入是又是

java程式碼實現MySQL資料庫表千萬條資料去重

準備工作: 1.資料庫表:datatest5 2.欄位: id 主鍵自增,val 重複值欄位併為val建立索引 3.設定innodb緩衝池大小 show variables like "%_buffer%"; SET GLOBAL innod

PHP 強大的字元處理能力之CSV資料去重

/* 對以下資料進行去重【detail_new_urls.csv 檔案中內容】 http://www.cnvd.org.cn/flaw/show/CNVD-2018-01155 http://www.cnvd.org.cn/flaw/show/CNVD-2017-01206 http:

再戰mysql 資料去重

年初時，寫過一篇去重的，在小表中還能用用，在大表中真的是效率低下，現在給了一次優化 https://www.cnblogs.com/jarjune/p/8328013.html 繼上一篇文章方法三： DELIMITER // DROP PROCEDURE IF EXISTS delete_

Python List資料去重和巢狀List資料去重

單個list中資料去重例如：去除a中重複的資料 ‘b’ a = ['a','b','c','b'] b = list(set(a)) print(b) 輸出結果為： ['a', 'c', 'b'] 巢狀list中去除相同list資料

爬蟲資料去重

使用資料庫建立關鍵欄位（一個或者多個）建立索引進行去重根據url地址進行去重使用場景：url地址對應的資料不會變的情況，url地址能夠唯一判別一條資料的情況思路：　　url存在Redis中　　拿到url地址，判斷url在Redis的集合中是否存在　　　　存在：說明url地址已經被請求過

使用 Java8的 stream對list資料去重

list去重，根據物件某個屬性、某幾個屬性去重去除List中重複的String List unique = list.stream().distinct().collect(Collec

MongoDB + Robo 3T資料去重

Robo 3T是MongoDB資料庫的視覺化介面軟體。爬蟲時會因為某些原因，導致抓取的資料會有重複的。除了在程式碼中處理去重，我們可以動動手指即可實現去重處理。因此，我們可以指定欄位來去重。只需在Robo 3T視覺化介面中輸入一下程式碼即可： db.getC

python3 requests 對資料去重入庫

對於自己使用requests創寫的類及方法想要對資料進行去重入庫問題1：如何做到資料去重處理？答：使用set，並且set能夠自動實現去重效果。問題2：當列表中有大資料怎麼提高讀取速度？答：轉換為set型別。 a = [1, 4, 7, 2, 1, 8, 7

MapReduce案例3——求簡單資料去重

相關推薦