分別使用Hadoop和Spark實現二次排序

阿新 • • 發佈：2019-02-09

零、序（注意本部分與標題無太大關係，可直接翻到第一部分）

　　既然沒用為啥會有序？原因不想再開一篇文章，來抒發點什麼感想或者計劃了，就在這裡寫點好了：

　　前些日子買了幾本書，打算學習和研究大資料方面的知識，一直因為實習、考試、畢業設計等問題搞得沒有時間，現在進入了寒假，可以安心的學點有用的知識了。

　　這篇部落格裡的演算法部分的內容來自《資料演算法：Hadoop/Spark大資料處理技巧》一書，不過書中的程式碼雖然思路正確，但是程式碼不完整，並且只有java部分的程式設計，我在它的基礎上又加入scala部分，當然是在使用Spark的時候寫的scala。

　　廢話不多說，進入正題。

一、輸入、期望輸出、思路。

輸入為SecondarySort.txt，內容為：

2000,12,04,10

2000,11,01,20

2000,12,02,-20

2000,11,07,30

2000,11,24,-40

2012,12,21,30

2012,12,22,-20

2012,12,23,60

2012,12,24,70

2012,12,25,10

2013,01,23,90

2013,01,24,70

2013,01,20,-10

意義為：

年，月，日，溫度

期望輸出：

2013-01 90,70,-10

2012-12 70,60,30,10,-20

2000-12 10,-20

2000-11 30,20,-40

意義為：

年-月溫度1，溫度2，溫度3，……

年-月從上之下降序排列，

溫度從左到右降序排列

思路：

拋棄不需要的代表日的哪一行資料

將年月作為組合鍵（key），比較大小，降序排列

將對應年月（key）的溫度的值（value）進行降序排列和拼接

二、使用Java編寫MapReduce程式實現二次排序

程式碼要實現的類有：

除了常見的SecondarySortingMapper，SecondarySortingReducer，和SecondarySortDriver以外

這裡還多出了兩個個外掛類（DateTemperatureGroupingComparator和DateTemperaturePartioner）和一個自定義型別（DateTemperaturePair）

以下是實現的程式碼（注意以下每個檔案的程式碼段我去掉了包名，所以要使用的話自己加上吧）：

SecondarySortDriver.java

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class SecondarySortDriver extends Configured implements Tool {

public int run(String[] args) throws Exception {

Configuration configuration = getConf();

Job job = Job.getInstance(configuration, "SecondarySort");

job.setJarByClass(SecondarySortDriver.class);

job.setJobName("SecondarySort");

Path inputPath = new Path(args[0]);

Path outputPath = new Path(args[1]);

FileInputFormat.setInputPaths(job, inputPath);

FileOutputFormat.setOutputPath(job, outputPath);

// 設定map輸出key value格式

job.setMapOutputKeyClass(DateTemperaturePair.class);

job.setMapOutputValueClass(IntWritable.class);

// 設定reduce輸出key value格式

job.setOutputKeyClass(DateTemperaturePair.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(SecondarySortingMapper.class);

job.setReducerClass(SecondarySortingReducer.class);

job.setPartitionerClass(DateTemperaturePartitioner.class);

job.setGroupingComparatorClass(DateTemperatureGroupingComparator.class);

boolean status = job.waitForCompletion(true);

return status ? 0 : 1;

}

public static void main(String[] args) throws Exception {

if (args.length != 2) {

throw new IllegalArgumentException(

"!!!!!!!!!!!!!! Usage!!!!!!!!!!!!!!: SecondarySortDriver"

+ "<input-path> <output-path>");

}

int returnStatus = ToolRunner.run(new SecondarySortDriver(), args);

System.exit(returnStatus);

}

DateTemperaturePair.java

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.Writable;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

public class DateTemperaturePair implements Writable,

WritableComparable<DateTemperaturePair> {

private String yearMonth;

private String day;

protected Integer temperature;

public int compareTo(DateTemperaturePair o) {

int compareValue = this.yearMonth.compareTo(o.getYearMonth());

if (compareValue == 0) {

compareValue = temperature.compareTo(o.getTemperature());

}

return -1 * compareValue;

}

public void write(DataOutput dataOutput) throws IOException {

Text.writeString(dataOutput, yearMonth);

dataOutput.writeInt(temperature);

}

public void readFields(DataInput dataInput) throws IOException {

this.yearMonth = Text.readString(dataInput);

this.temperature = dataInput.readInt();

}

@Override

public String toString() {

return yearMonth.toString();

}

public String getYearMonth() {

return yearMonth;

}

public void setYearMonth(String text) {

this.yearMonth = text;

}

public String getDay() {

return day;

}

public void setDay(String day) {

this.day = day;

}

public Integer getTemperature() {

return temperature;

}

public void setTemperature(Integer temperature) {

this.temperature = temperature;

}

SecondarySortingMapper.java

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class SecondarySortingMapper extends

Mapper<LongWritable, Text, DateTemperaturePair, IntWritable> {

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String[] tokens = value.toString().split(",");

// YYYY = tokens[0]

// MM = tokens[1]

// DD = tokens[2]

// temperature = tokens[3]

String yearMonth = tokens[0] + "-" + tokens[1];

String day = tokens[2];

int temperature = Integer.parseInt(tokens[3]);

DateTemperaturePair reduceKey = new DateTemperaturePair();

reduceKey.setYearMonth(yearMonth);

reduceKey.setDay(day);

reduceKey.setTemperature(temperature);

context.write(reduceKey, new IntWritable(temperature));

}

DateTemperaturePartioner.java

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

public class DateTemperaturePartitioner extends

Partitioner<DateTemperaturePair, Text> {

@Override

public int getPartition(DateTemperaturePair dataTemperaturePair, Text text,

int i) {

return Math.abs(dataTemperaturePair.getYearMonth().hashCode() % i);

}

DateTemperatureGroupingComparator.java

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

public class DateTemperatureGroupingComparator extends WritableComparator {

public DateTemperatureGroupingComparator() {

super(DateTemperaturePair.class, true);

}

@Override

public int compare(WritableComparable a, WritableComparable b) {

DateTemperaturePair pair1 = (DateTemperaturePair) a;

DateTemperaturePair pair2 = (DateTemperaturePair) b;

return pair1.getYearMonth().compareTo(pair2.getYearMonth());

}

SecondarySortingReducer.java

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class SecondarySortingReducer extends

Reducer<DateTemperaturePair, IntWritable, DateTemperaturePair, Text> {

@Override

protected void reduce(DateTemperaturePair key,

Iterable<IntWritable> values, Context context) throws IOException,

InterruptedException {

StringBuilder sortedTemperatureList = new StringBuilder();

for (IntWritable temperature : values) {

sortedTemperatureList.append(temperature);

sortedTemperatureList.append(",");

}

sortedTemperatureList.deleteCharAt(sortedTemperatureList.length()-1);

context.write(key, new Text(sortedTemperatureList.toString()));

}

三、使用scala編寫Spark程式實現二次排序

這個程式碼想必就比較簡潔了。如下：

SecondarySort.scala

package spark

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.rdd.RDD.rddToOrderedRDDFunctions

import org.apache.spark.rdd.RDD.rddToPairRDDFunctions

object SecondarySort {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName(" Secondary Sort ")

.setMaster("local")

var sc = new SparkContext(conf)

sc.setLogLevel("Warn")

//val file = sc.textFile("hdfs://localhost:9000/Spark/SecondarySort/Input/SecondarySort2.txt")

val file = sc.textFile("e:\\SecondarySort.txt")

val rdd = file.map(line => line.split(","))

.map(x=>((x(0),x(1)),x(3))).groupByKey().sortByKey(false)

.map(x => (x._1._1+"-"+x._1._2,x._2.toList.sortWith(_>_)))

rdd.foreach(

x=>{

val buf = new StringBuilder()

for(a <- x._2){

buf.append(a)

buf.append(",")

}

buf.deleteCharAt(buf.length()-1)

println(x._1+" "+buf.toString())

})

sc.stop()

}

分別使用Hadoop和Spark實現二次排序

零、序（注意本部分與標題無太大關係，可直接翻到第一部分）　　既然沒用為啥會有序？原因不想再開一篇文章，來抒發點什麼感想或者計劃了，就在這裡寫點好了：　　前些日子買了幾本書，打算學習和研究大資料方面的知識，一直因為實習、考試、畢業設計等問題搞得沒有時間，現在進入了寒

Spark實現二次排序

1、HDFS檔案說明檔案為普通的文字檔案，無壓縮，\001分割，共3列，一次為province_id,city_id,city_uv需要按照province_id升序，city_uv降序操作2、程式碼var data = sc.textFile("/home/hdfs/te

Hadoop和Spark分別實現二次排序

將下列資料中每個分割槽中的第一列順序排列，第二列倒序排列。 Text 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 2021 5051

Spark:Java實現二次排序

測試資料 1 5 2 4 3 6 1 3 2 1 輸出結果 1 3 1 5 2 1 2 4 3 6 實現思路： 1.實現自定義的key，要實現Ordered介面和Serializable介面，在key中實現自己對多個列的排序演算法 2.將包含文

結合案例講解MapReduce重要知識點 ------- 使用自定義MapReduce資料型別實現二次排序

自定義資料型別SSData import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableCompa

43.top10熱門品類之使用Scala實現二次排序

本文為《Spark大型電商專案實戰》系列文章之一，主要介紹使用Scala實現二次排序。程式碼實現在Scala IDE中的包com.erik.sparkproject中建立SortKey.sca

Spark分組二次排序

package com.ibeifeng.spark.core import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ArrayBu

hadoop 二次排序和一個java實現

需要二次排序的原因：mapreduce架構自動對對映器生成的鍵進行排序，即歸約器啟動之前，所有鍵是有序的，但是值是隨機的，二次排序指的是對值進行排序。歸約器輸入形如：，即一個key對應多個值，這些值是無序的，排序後得到有序的值，如下：其中，S按照升序或者降序排列

Python Hadoop Mapreduce 實現Hadoop Streaming分組和二次排序

需求：公司給到一份全國各門店銷售資料，要求：1.按門店市場分類，將同一市場的門店放到一起；2.將各家門店按銷售額從大到小，再按利潤從大到小排列一需求一：按市場對門店進行分組分組(partition) Hadoop streaming框架預設情況下會以’/t

hadoop二次排序的原理和實現

預設情況下，Map輸出的結果會對Key進行預設的排序，但是有時候需要對Key排序的同時還需要對Value進行排序，這時候就要用到二次排序了。下面我們來說說二次排序 1、二次排序原理我們把二次排序分為以下幾個階段 Map起始階段在Map階段，使用jo

一起學Hadoop——二次排序演算法的實現

二次排序，從字面上可以理解為在對key排序的基礎上對key所對應的值value排序，也叫輔助排序。一般情況下，MapReduce框架只對key排序，而不對key所對應的值排序，因此value的排序經常是不固定的。但是我們經常會遇到同時對key和value排序的需求，例如Hadoop權威指南中的求一年的高高氣溫

Hadoop MapReduce二次排序演算法與實現之演算法解析

MapReduce二次排序的原理 1.在Mapper階段，會通過inputFormat的getSplits來把資料集分割成split public abstract class Input

hadoop二次排序 (Map/Reduce中分割槽和分組的問題)

1.二次排序概念：首先按照第一欄位排序，然後再對第一欄位相同的行按照第二欄位排序，注意不能破壞第一次排序的結果。如：輸入檔案：20 21 50 51 50 52 50 53 50 54 60 51 60 53 60 52 60 56 60 57 70 58 60 61 70 54 70 55 70 56

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（一）——二次排序

寫在前面：在做直播的時候有同學問Spark不是用Scala語言作為開發語言麼，的確是的，從網上查資料的話也會看到大把大把的用Scala編寫的Spark程式，但是仔細看就會發現這些用Scala寫的文章

spark學習記錄（七、二次排序和分組取TopN問題）

1.二次排序例題：將兩列數字按第一列升序，如果第一列相同，則第二列升序排列資料檔案：https://download.csdn.net/download/qq_33283652/10894807 將資料封裝成物件，對物件進行排序，然後取出value public class Se

Spark 二次排序自定義key 實現(Java)

楔子 spark java版本的二次排序實現資料如下 2::4 2::10 3::6 1::5 按照第一列和第二列倒敘排列實現如下的結果 3::6 2::10 2::4 1::5 demo GitHub 位置的 Second

Hadoop鏈式MapReduce、多維排序、倒排索引、自連線演算法、二次排序、Join效能優化、處理員工資訊Join實戰、URL流量分析、TopN及其排序、求平均值和最大最小值、資料清洗ETL、分析氣

Hadoop Mapreduce 演算法彙總第52課：Hadoop鏈式MapReduce程式設計實戰...1 第51課：Hadoop MapReduce多維排序解析與實戰...2 第50課：HadoopMapReduce倒排索引解析與實戰...3 第49課：Hado

MapReduce二次排序原理和實現

/** * 自己定義的key類應該實現WritableComparable介面 */ public class IntPair implements WritableComparable<IntPair>{ int first;//第一個成員變數 int second;//第二個成員變數 p

Hadoop 二次排序實現

業務場景:通常情況下,在MR操作中到達Reduce中的key值都是按照指定的規則進行排序,在單一key的情況下一切都進行的很自然,直到我們要求資料不再單純的按key進行排序,以如下資料舉例: Key -> value: 100 -> 2

hadoop二次排序實現join

package join; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.Iterator; import org.apac

分別使用Hadoop和Spark實現二次排序

相關推薦