MapReduce案例9——多個數字檔案的資料排序並新增序號(新增可並行方法）

阿新 • • 發佈：2019-01-01

題目：

數字排序並加序號源資料：
2
32
654
32
15
756
65223
5956
22
650
92
26
54
6


最張結果：
1  2
2  6
3  15
4  22
5  26
6  32
7  32
8  54
9  92
10 650
11 654
12 756
13 5956
14 65223


一定要考慮 當資料量一大的時候， 你的實現思路能否使用。

解題思路：當有多個無序檔案需要進行排序，並且在數字前面加入索引，首先考慮使用MapReduce的預設排序方法，在map裡面進行排序，然後設定全域性計數變數記錄索引值，通過設定全域性臨時變數記錄上個值的大小，如果當前值大於臨時值，計數變數加1，否則不變，然後進行輸出

程式碼如下：

/**
 * @author: lpj   
 * @date: 2018年3月16日 下午7:16:47
 * @Description:
 */
package lpj.reduceWork;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
 *
 */
import org.apache.hadoop.security.token.Token.PrivateToken;
public class BigNumFileSortMR {
	private static int countnum = 0;
	private static int temNum = 0;
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
//		conf.addResource("hdfs-site.xml");//使用配置檔案
//		System.setProperty("HADOOP_USER_NAME", "hadoop");//使用叢集
		FileSystem fs = FileSystem.get(conf);//預設使用本地
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(BigNumFileSortMR.class);
		job.setMapperClass(BigNumFileSortMR_Mapper.class);
		job.setReducerClass(BigNumFileSortMR_Reducer.class);
		
		job.setMapOutputKeyClass(IntWritable.class);
		job.setMapOutputValueClass(NullWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		
		Path inputPath = new Path("d:/a/homework9/input/");//讀入多個檔案
		Path outputPath = new Path("d:/a/homework9/output/");//輸出一個檔案
		if (fs.exists(inputPath)) {
			fs.delete(outputPath, true);
		}
		
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		boolean isdone = job.waitForCompletion(true);
		System.exit(isdone ? 0 : 1);
	}
	
	public static class BigNumFileSortMR_Mapper extends Mapper<LongWritable, Text, IntWritable, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
			int num = Integer.parseInt(value.toString());
			context.write(new IntWritable(num), NullWritable.get());
			
		}
	}
	public static class BigNumFileSortMR_Reducer extends Reducer<IntWritable, NullWritable, Text, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context)throws IOException, InterruptedException {
			for(NullWritable niv : values){
				if (key.get() > temNum) {
					countnum++;//全域性排序變數
					temNum = key.get();//記錄當前臨時值
				}
				String kk = countnum + "\t" + key.toString();
				kout.set(kk);
				context.write(kout, NullWritable.get());
			}
		}
		
	}

}

測試：在輸入檔案中，將給出的數值複製3份，作為輸入檔案，執行後的結果為：

當輸入檔案為大檔案，多個檔案時，使用一個reduce任務輸出壓力過大，因此採用多reduce方法：

程式碼如下：

/**
 * @author: lpj   
 * @date: 2018年3月16日 下午7:16:47
 * @Description:
 */
package lpj.reduceWork;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SplitLocationInfo;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.record.Index;
/**
 *
 */
import org.apache.hadoop.security.token.Token.PrivateToken;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import lpj.reduceWorkbean.MyPatitionerBigFileSum;

public class BigNumFileSortMR2 extends Configured implements Tool{
	public static void main(String[] args) throws Exception {
		int run = ToolRunner.run(new BigNumFileSortMR2(), args);
		System.exit(run);
	}
	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hadoop02:9000");
		System.setProperty("HADOOP_USER_NAME", "hadoop");
		//---------------將檔案分組輸出-------------------------
		FileSystem fs = FileSystem.get(conf);//預設使用本地
		Job job = Job.getInstance(conf);
		job.setJarByClass(BigNumFileSortMR2.class);
		job.setMapperClass(BigNumFileSortMR_Mapper.class);
		job.setReducerClass(BigNumFileSortMR_Reducer.class);
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(NullWritable.class);
		job.setPartitionerClass(MyPatitionerBigFileSum.class);
		job.setNumReduceTasks(4);
		Path inputPath = new Path("/a/homework9/input");//讀入多個檔案
		Path outputPath = new Path("/a/homework9/output1");//輸出多個檔案
		if (fs.exists(outputPath)) {
			fs.delete(outputPath, true);
		}
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		//-----------------統計每個檔案的個數，記錄檔名----------------------------
		FileSystem fs2 = FileSystem.get(conf);//預設使用本地
		Job job2 = Job.getInstance(conf);
		job2.setJarByClass(BigNumFileSortMR2.class);
		job2.setMapperClass(BigNumFileSortMR2_Mapper.class);
		job2.setReducerClass(BigNumFileSortMR2_Reducer.class);
		job2.setMapOutputKeyClass(Text.class);
		job2.setMapOutputValueClass(IntWritable.class);
		job2.setOutputKeyClass(Text.class);
		job2.setOutputValueClass(Text.class);
		Path inputPath2 = new Path("/a/homework9/output1");//讀入多個檔案
		Path outputPath2 = new Path("/a/homework9/output2");//輸出多個檔案
		if (fs2.exists(outputPath2)) {
			fs2.delete(outputPath2, true);
		}
		FileInputFormat.setInputPaths(job2, inputPath2);
		FileOutputFormat.setOutputPath(job2, outputPath2);
		//-----------------新增索引----------------------------
		FileSystem fs3 = FileSystem.get(conf);//預設使用本地
		Job job3 = Job.getInstance(conf);
		job3.setJarByClass(BigNumFileSortMR2.class);
		job3.setMapperClass(BigNumFileSortMR3_Mapper.class);
		job3.setNumReduceTasks(0);
		job3.setMapOutputKeyClass(Text.class);
		job3.setMapOutputValueClass(NullWritable.class);
		URI uri = new URI("/a/homework9/output2/part-r-00000");
		job3.addCacheFile(uri);
		Path inputPath3 = new Path("/a/homework9/output1");//讀入多個檔案
		Path outputPath3 = new Path("/a/homework9/output3");//輸出多個檔案
		if (fs3.exists(outputPath3)) {
			fs3.delete(outputPath3, true);
		}
		FileInputFormat.setInputPaths(job3, inputPath3);
		FileOutputFormat.setOutputPath(job3, outputPath3);
		
		//--------------------------------------
		ControlledJob aJob = new ControlledJob(job.getConfiguration());
		ControlledJob bJob = new ControlledJob(job2.getConfiguration());
		ControlledJob cJob = new ControlledJob(job3.getConfiguration());
		aJob.setJob(job);
		bJob.setJob(job2);
		cJob.setJob(job3);
		JobControl jc = new JobControl("jc");
		jc.addJob(aJob);
		jc.addJob(bJob);
		jc.addJob(cJob);
		bJob.addDependingJob(aJob);
		cJob.addDependingJob(bJob);
		Thread thread = new Thread(jc);
		thread.start();
		while(!jc.allFinished()){
			thread.sleep(1000);
		}
		jc.stop();
		return 0;
	}
	//------------------將檔案分組輸出-------------------------
	public static class BigNumFileSortMR_Mapper extends Mapper<LongWritable, Text, IntWritable, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
			int num = Integer.parseInt(value.toString());
			context.write(new IntWritable(num), NullWritable.get());
		}
	}
	public static class BigNumFileSortMR_Reducer extends Reducer<IntWritable, NullWritable, IntWritable, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		@Override
		protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context)throws IOException, InterruptedException {
			for(NullWritable niv : values){
				context.write(key, NullWritable.get());
			}
			Configuration configuration = context.getConfiguration();
			FileSystem fs = FileSystem.get(configuration);
		}
	}
	//--------------------------------統計每個檔案的個數，記錄檔名（根據分割槽資訊）------------------------------------------
	public static class BigNumFileSortMR2_Mapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		Set<Integer> numset = new HashSet<>();//記錄不重複元素個數
		@Override
		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
			int num = Integer.parseInt(value.toString());//記錄資料大小
			InputSplit inputSplit = context.getInputSplit();
			FileSplit fSplit = (FileSplit)inputSplit;
			System.out.println(inputSplit.getLength());
			String filename = fSplit.getPath().getName();//獲取當前檔名
			int count = 1;//記錄檔案元素
			if (num >=0 && num < 100) {
				numset.add(num);
			}else if (num >= 100 && num < 500) {
				numset.add(num);
			}else if (num >= 500 && num < 1000) {
				numset.add(num);
			}else {
				numset.add(num);
			}
			if (inputSplit.getLength() == 0) {
				context.write(new Text(filename), new IntWritable(0));
			}else{
				context.write(new Text(filename), new IntWritable(numset.size()));
			}
		}
	}
	public static class BigNumFileSortMR2_Reducer extends Reducer<Text, IntWritable, Text, Text>{
		Text kout = new Text();
		Text valueout = new Text();
		int index = 1;
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
			//每組取第一個，作為不重複輸入
			int count = 0;
			for(IntWritable niv : values){
				count++;
				if (count <= 1) {
					context.write(key,new Text(niv.get() + "\t" + index));
					index += niv.get();
				}else {
					return;
				}
			}
		}
	}
	//載入檔案資訊到記憶體，然後進行序號新增
	//-------------------新增序號------------------------
	public static class BigNumFileSortMR3_Mapper extends Mapper<LongWritable, Text, Text, NullWritable>{
		Text kout = new Text();
		Text valueout = new Text();
		Map<String, Integer> filecount = new HashMap<>();
		int firstnum = 0;
		int index = 0;
		int count = 0;
		@SuppressWarnings("deprecation")
		@Override
		protected void setup(Context context)throws IOException, InterruptedException {
			Path[] paths = context.getLocalCacheFiles();
			String str = paths[0].toUri().toString();
			BufferedReader bf = new BufferedReader(new FileReader(new File(str)));
			String readline = null;
			while((readline = bf.readLine()) != null){
				String[] split = readline.split("\t");
				filecount.put(split[0], Integer.parseInt(split[2]));
			}
			IOUtils.closeStream(bf);
//			filecount.put("part-r-00000", 1);
//			filecount.put("part-r-00002", 9);
//			filecount.put("part-r-00003", 12);
			//取出當前檔案的起始索引
			InputSplit inputSplit = context.getInputSplit();
			FileSplit fileSplit = (FileSplit)inputSplit;
			String name = fileSplit.getPath().getName();
			if (fileSplit.getLength() > 0) {
				
				index = filecount.get(name);
			}
			
		}

		@Override
		protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {
			int num = Integer.parseInt(value.toString());
			count++;
			if (count == 1) {
				firstnum = num;
				String kk = index + "\t" + num;
				kout.set(kk);
				context.write(kout, NullWritable.get());
			}else {
				if (num == firstnum) {
					String kk = index + "\t" + num;
					kout.set(kk);
					context.write(kout, NullWritable.get());
				}else {
					index ++;
					String kk = index + "\t" + num;
					kout.set(kk);
					context.write(kout, NullWritable.get());
				}
				firstnum = num;
			}
		}
	}
	public static class BigNumFileSortMR3_Reducer extends Reducer<Text, IntWritable, Text, Text>{
		@Override
		protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
		}
	}
}

總結：此類問題無需一個MapReduce就完成任務，將問題拆分後，就會變得簡單明瞭。對於全域性排序，當檔案資料較少，檔案較小時，選用第一種方法

對於大資料處理來說，第二種更為通用

並行難點在於將前面排序的好的檔案的元素個數，以及索引起始位置記錄，然後進行排序

MapReduce案例9——多個數字檔案的資料排序並新增序號(新增可並行方法）

題目：數字排序並加序號源資料： 2 32 654 32 15 756 65223 5956 22 650 92 26 54 6 最張結果： 1 2 2 6 3 15 4 22 5 26 6 32 7 32 8 54 9 92 10 650 11 654

怎樣用matlab讀取一個資料夾下的多個子資料夾中的多個圖片檔案

maindir = 'E:\Temp Folder'; subdir = dir( maindir ); % 先確定子資料夾 for i = 1 : length( subdir ) if( isequal( subdir( i ).name, '.' )

解決MapReduce中多個小檔案合併成大檔案問題

package inputformat; import java.io.IOException; import org.apache.commons.io.IOUtils; import org.apache.hadoop.conf.Configuration; import

從多個tfrecord檔案中讀取資料的方法

tfrecord_file_path = '/train/*.tfrecords’#train是存放tfrecord的資料夾 filename_queue = tf.train.string_input_producer( tf.tr

Python同時讀取多個csv檔案，進行簡單的資料處理

#匯入相應模組 import re import os import pandas as pd import numpy as np# 讀取 10個csv 檔案 path = 'E:/round1/' files = os.listdir(path) files_csv =

julia如何同時開啟多個文字檔案並對其寫入資料

# 本文將簡要介紹一下如何同時開啟多個文字檔案並對其寫入資料 # 程式碼如下 n = 10 # 檔案數 filelist = Vector{String}(n) # 建立10個空字串 for i = 1:n filelist[i] = string( i

內容分多個excel檔案進行儲存資料

/** * 分多個excel檔案進行儲存資料 * @author hanchuang * */ public class AccessExcel { String[] titleCell; String[][] allC

把多個excel檔案的sheet1資料合併到一個excel檔案的sheet1中

把所需合併的Excel文件放在同一個資料夾中,再資料夾中新建一個Excel開啟Excel，按ALT+F11，或依次點工具，巨集，VB編輯器，雙擊工程資源管理器裡面的sheet1(sheet1),在右側的程式碼區貼上如下程式碼。 Sub Com() Dim MyPath

java程式碼實現CSV檔案讀取、將資料拆分成多個CSV檔案及資料匯出到CSV檔案

package com.cn; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileInputStream; impor

利用python對多個txt檔案中的資料進行篩選

一、問題描述篩選出多個txt檔案中需要的資料二、資料準備這是我自己建立的要處理的檔案，裡面是隨意寫的一些數字和字母三、程式編寫 import os def eachFile(filepath):

Python_合併多個資料夾下的多個csv檔案

# -*- coding:utf8 -*- import os.path import os import csv import re path = "D:\Datebase\data1\DataChallengeOne" #i = 0 files = os.listdir(path) with open(

一個簡單的MapReduce示例（多個MapReduce任務處理）

.lib exceptio apr private util sum length reat lin 一、需求　　有一個列表，只有兩列：id、pro，記錄了id與pro的對應關系，但是在同一個id下，pro有可能是重復的。　　現在需要寫一個程序，統計一下每個id下有

多個數字求和

ont .cn += double image 流程圖 eof 技術遍歷 1.設計思想從命令行內輸入多個參數，先把輸入的參數遍歷一遍計算出參數的個數count，然後申請一個大小為count的double型數組用來存放轉化為double型的參數。最後將doub

java從命令行接收多個數字，求和程序分析

tro 命令行截圖 erl ppa 問題：程序代碼 stat sys 問題：編寫一個程序，此程序從命令行接收多個數字，求和之後輸出結果。 1.設計思想（1）聲明兩個變量接收輸入的字符串（2）將字符串轉換成int類型（3）輸出求和 2.程序流程圖 3.源程序

用java實現從命令行接收多個數字，求和之後輸出結果

system 程序流程圖 sta num 思想 pri for循環含義自動 1.設計思想首先要了解從命令行輸入數字的含義，不需要在程序中自己定義。需要定義int類的num和sum。之後利用num=Integer.parseInt(arg);將String型轉化為int

課程作業01 模仿JavaAppArguments.java示例，編寫一個程序，此程序從命令行接收多個數字，求和之後輸出結果。

技術 next highlight nbsp [] ++ pub 完成一個設計思想：主要是在命令行進行參數的轉化並相加輸出。由於命令行參數都是字符串，想要進行相加運算必須進行轉化，借助示例JavaAppArguments.java可以很好的進行轉化，之後再進行相加運算就

從命令行接收多個數字，求和輸出結果

logs print 變量 sta mil alt mage gin splay 1.設計思想 ① 定義表示和的變量sum。 ② 寫入數據，且數據為字符串類型。 ③ 在屏幕顯示寫入的數據，看看是否正確。 ④ 將數據由字符串類型轉換為所用的類型（關鍵），本程序中為do

課堂作業1 多個數字求和

整型 mount win 代碼輸出 -1 swing vax cnblogs 一，設計思想 1，通過簡單的窗口實現多個數字的輸入與輸出。 2，可通過用戶輸入數字的數量來實現多個數字的求和。 3，定義整型數組變量number和字符串型數組變量integer，將輸入的字符串變

課程作業01：模仿JavaAppArguments.java示例，編寫一個程序，此程序從命令行接收多個數字，求和之後輸出結果。

解決問題數據代碼 spa clas blog 數字循環求和截圖一、程序設計思想解決問題的關鍵在於將參數進行數據類型的轉化，利用運行配置輸入的數據類型是String類型，故需要用Integer.parse（）或者Integer.valueOf（）將原有的Strin

Java程序：從命令行接收多個數字，求和並輸出結果

程序代碼 ati for right 流程圖 public 技術參數由於程序設計思想：由於命令行接收的是字符串類型，因此應先將字符串類型轉化為整型或其他字符型，然後利用for循環求和並輸出結果程序流程圖：源程序代碼： 1 public class Ja

MapReduce案例9——多個數字檔案的資料排序並新增序號(新增可並行方法）

相關推薦