Hadoop 二次排序實現

阿新 • • 發佈：2019-01-18

業務場景:通常情況下,在MR操作中到達Reduce中的key值都是按照指定的規則進行排序,在單一key的情況下一切都進行的很自然,直到我們要求資料不再單純的按key進行排序,以如下資料舉例:

Key -> value:

100 -> 2017-02-27 19:21:31,45,67,68

50 -> 2017-02-27 19:22:04,89,90,56

90 -> 2017-02-27 19:22:27,90,89,99

50 -> 2017-02-27 19:20:42,88,45,89

現要求對結果資料進行分組,key值相同的為一組且組內有序

處理方式:

原理:藉助MR排序的優勢,提供可擴充套件的二次排序操作

流程

Map ->(複合主鍵,value) à自定義分割槽函式àReduce-

例項:

資料目錄如下,其中.loz結尾的為真正的資料檔案,以其檔名為字首.writed為字尾的問檔案標識當前.loz檔案寫狀態,只有處於.writed狀態的loz檔案為有效可讀檔案

資料格式如下:VERSION=1.0,PASSTIME=2016-11-30 00:00:39 000,CARSTATE=1,CARPLATE=無,PLATETYPE=99,SPEED=0,PLATECOLOR=4,LOCATIONID=-1,DEVICEID=-1,DRIVEWAY=2,DRIVEDIR=4,CAPTUREDIR=1,CARCOLOR=10,CARBRAND=99,CARBRANDZW=其它,TGSID=1125,PLATECOORD=0,0,0,0,CABCOORD=0,0,0,0,IMGID1=http://11.110.248.59:9099/image/dhdfs/2016-11-30/archivefile-2016-11-30-000040-00677B0200000001:5750848/308059.jpg,IMGID2=,IMGID3=,

資料量>=2G

要求如下:統計全部資料中每輛車按時間序列經過的卡口資訊(卡口欄位為元資料中TGSID列) 資料輸出格式為,檔名=號牌+輔助欄位,內容=過車時間(PASSTIME)-卡口編號(TGSID),…

程式設計實現:

1. CarOrder.class,自定義組合鍵,藉助MR的KEY排序操作實現Map內按key-time排序操作,要求實現兩序列化介面

classCarOrder implements Writable, WritableComparable<CarOrder> {

//號牌

private Text carPlate;

//過車時間

private Text day;

 

public Text getDay() {

           return day;

}

 

public void setDay(Text day) {

           this.day = day;

}

 

public CarOrder() {

           carPlate = new Text();

           day = new Text();

}

 

public CarOrder(Text carPlate, Text day){

           super();

           this.carPlate = carPlate;

           this.day = day;

}

 

public int compareTo(CarOrder co){

           int compareValue = this.carPlate.compareTo(co.carPlate);

           // 相等

           if (compareValue == 0) {

                    compareValue = this.day.compareTo(co.day);

           }

           return compareValue;

}

 

public void write(DataOutput out)throws IOException {

           this.carPlate.write(out);

           this.day.write(out);

}

 

public void readFields(DataInputin) throws IOException {

           this.carPlate.readFields(in);

           this.day.readFields(in);

}

 

public Text getCarPlate() {

           return carPlate;

}

 

public void setCarPlate(TextcarPlate) {

           this.carPlate = carPlate;

}

 

@Override

public String toString() {

           return "CarOrder[carPlate=" + carPlate + ", day=" + day + "]";

}

 

}

2. CarComparator.java,定義分組比較器,決定在MR SHUFFLE過程中對資料分組的依據,要求號牌相同時間不同的為同一組

classCarComparator extends WritableComparator {

public CarComparator() {

           // 指定Key值

           super(CarOrder.class, true);

}

@SuppressWarnings("rawtypes")

@Override

public int compare(WritableComparable a,WritableComparable b) {

           CarOrder car1 = (CarOrder) a;

           CarOrder car2 = (CarOrder) b;

           return car1.getCarPlate().compareTo(car2.getCarPlate());

}

}

3. CarPartitioner.class,自定義分割槽函式,Mapper操作中間結果分割槽依據,將資料均勻劃分.

classCarPartitioner extends Partitioner<CarOrder, Text> {

@Override

public int getPartition(CarOrder key, Textvalue, int numPartitions) {

           return Math.abs(key.getCarPlate().hashCode())% numPartitions;

}

}

4. CarMap.java Mapper函式,此處做簡單的資料解析工作,檔案過濾在驅動器節點完成,Mapper操作封裝複合鍵CarOrder輸出(CarOrder,”時間-卡口編號”)

classCarMap extends Mapper<LongWritable, Text, CarOrder, Text> {

@Override

protected void map(LongWritable key, Textvalue, Context context) throws IOException, InterruptedException {

           String temp = value.toString();

           if (temp.length() > 13) {

                    temp = temp.substring(12);

                    String[] items =temp.split(",");

                    if (items.length > 10) {

                             // CarPlate As Key

                             if(!items[2].endsWith("無牌")){

                                       try {

                                                CarOrderco = new CarOrder(new Text(items[2].substring(9)), newText(items[0].substring(9)));

                                                //time + tgsid

                                                context.write(co,new Text(items[0].substring(9)+"-"+items[14].substring(6)));

                                       } catch(Exception e) {

                                                e.printStackTrace();

                                       }

                             }

                    }

           }

}

}

5. CarCombine.java ,考慮到Map端資料可能會偏大的情況,使用MR的Combine特性對Map的結果預先進行處理,減少Mapper端輸出,降低IO操作提高程式效能.

classCarCombine extends Reducer<CarOrder, Text, CarOrder, Text> {

@Override

protected void reduce(CarOrder co,Iterable<Text> values, Reducer<CarOrder, Text, CarOrder,Text>.Context context)

                    throws IOException,InterruptedException {

           StringBuffer buf = newStringBuffer();

           String before = null;

           String current = null;

           for (Text text : values) {

                    current = text.toString();

                    if (current.equals(before)){

                             continue;

                    }

                    buf.append(current);

                    buf.append(',');

                    before = current;

           }

           if(buf.length()==0){

                    return ;

           }

           context.write(co, newText(buf.toString()));

}

}

6. OutputByCarPlate.java ,Reduce操作,由於要求結果按好配進行分類,經測試發現,在MR中使用大量的檔案輸出並不是好事,舉例來說如果有50萬車輛則可能需要50萬個檔案來儲存,效能開銷極大容易造成MR假死,甚至記憶體溢位! 未避免此問題,此處仍然採用多檔案輸出只不過降級為:按所有資料以天為檔案劃分依據進行儲存,單個車輛儲存一行。

classOutputByCarPlate extends Reducer<CarOrder, Text, NullWritable, Text> {

 

MultipleOutputs<NullWritable, Text> mo;

@Override

protected void setup(Reducer<CarOrder, Text,NullWritable, Text>.Context context)

                    throws IOException,InterruptedException {

           mo = newMultipleOutputs<NullWritable, Text>(context);

}

 

@Override

protected void reduce(CarOrder key,Iterable<Text> values, Context context)

                    throws IOException,InterruptedException {

           StringBuffer buf = newStringBuffer();

           for (Text text : values) {

                    buf.append(text.toString());

                    buf.append(',');

           }

           String value = buf.toString();

           String[] flows =value.split(",");

           if (flows.length >= 3) {

                    String prefix =key.getDay().toString().replaceAll("[-\\s:]", "");

                    mo.write(NullWritable.get(),new Text(value.substring(0, value.length() - 1)), prefix.substring(0, 8));

           }

}

@Override

protected void cleanup(Reducer<CarOrder,Text, NullWritable, Text>.Context context)

                    throws IOException,InterruptedException {

           mo.close();

}

}

7. Main.java ,驅動器節點,組織MR作業,預處理檔案,由於小檔案數量較多此處採用CombineTextInputFormat輸入文字檔案(因源資料使用LOZ壓縮演算法,經測試此處需要明確標識輸入塊的大小否則檔案合併無效,即: CombineTextInputFormat.setMaxInputSplitSize(job, 67108864);),MR提供的TextInputFormat.setInputPathFilter有侷限,只能過濾已經被識別的子目錄檔案,不支援動態修改,此處使用在Mapper之外過濾以CombineTextInputFormat.addInputPath(job, temp);方式追加輸入,以實現按特定要求去輸入檔案.

public class Main{

public static void main(String[] args) throwsException {

           Configuration conf = newConfiguration();

           conf.set("mapreduce.reduce.memory.mb","4096");

           Job job = Job.getInstance(conf,"TRACK_BY_TIME_TGSID");

           // 小檔案合併

           job.setInputFormatClass(CombineTextInputFormat.class);

           job.setJarByClass(cn.com.zjf.MR_04.Car1.class);

           job.setOutputKeyClass(Text.class);

           job.setOutputValueClass(Text.class);

           job.setMapOutputKeyClass(CarOrder.class);

           job.setMapOutputValueClass(Text.class);

           job.setMapperClass(CarMap.class);

           job.setReducerClass(OutputByCarPlate.class);

           // 組合鍵的排序規則

           //job.setSortComparatorClass(ResultComparator.class);

           // 自定義分割槽

           job.setPartitionerClass(CarPartitioner.class);

           // 自定義比較器-->二次排序的依據

           job.setGroupingComparatorClass(CarComparator.class);

           FileSystem fs = FileSystem.get(conf);

           Path input = new Path(args[0]);

           /**

            * 預處理檔案 .只讀取寫完畢的檔案 .writed結尾 .只讀取檔案大小大於0的檔案

            */

           {

                    FileStatus childs[] =fs.globStatus(input, new PathFilter() {

                             public booleanaccept(Path path) {

                                       if(path.toString().endsWith(".writed")) {

                                                returntrue;

                                       }

                                       returnfalse;

                             }

                    });

                    Path temp = null;

                    for (FileStatus file :childs) {

                             temp = newPath(file.getPath().toString().replaceAll(".writed", ""));

                             if (fs.listStatus(temp)[0].getLen()> 0) {

                                       CombineTextInputFormat.addInputPath(job,temp);

                             }

                    }

           }

           CombineTextInputFormat.setMaxInputSplitSize(job,67108864);

 

           Path output = new Path(args[1]);

           if (fs.exists(output)) {

                    fs.delete(output, true);

           }

           FileOutputFormat.setOutputPath(job,output);

           if (!job.waitForCompletion(true))

                    return;

}

}

Hadoop 二次排序實現

Hadoop 二次排序實現

hadoop二次排序實現join

一起學Hadoop——二次排序演算法的實現

hadoop 二次排序和一個java實現

hadoop二次排序的原理和實現

hadoop 二次排序的一些思考

hadoop二次排序 (Map/Reduce中分割槽和分組的問題)

hadoop二次排序

Hadoop二次排序及MapReduce處理流程例項詳解

Hadoop 二次排序 Secondary Sort

Hadoop MapReduce二次排序演算法與實現之演算法解析

Python Hadoop Mapreduce 實現Hadoop Streaming分組和二次排序

Hadoop和Spark分別實現二次排序

分別使用Hadoop和Spark實現二次排序

大資料技術學習筆記之Hadoop框架基礎5-Hadoop高階特性HA及二次排序思想

結合案例講解MapReduce重要知識點 ------- 使用自定義MapReduce資料型別實現二次排序

hadoop MR 二次排序

43.top10熱門品類之使用Scala實現二次排序

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（一）——二次排序

Hadoop Mapreduce分割槽、分組、二次排序過程詳解[轉]

Hadoop 二次排序實現

相關推薦