Hadoop/MapReduce反轉排序：控制規約器Reducer值的順序

阿新 • • 發佈：2019-01-11

例子：計算一個給定文件集中單詞的相對頻度。目標是建立一個N*N矩陣M，其中N為所有給定文件的單詞量，每個單元Mij包含一個特定上下文單詞Wi與Wj共同出現的次數。為簡單起見，將這個上下文定義為Wi的鄰域。例如：給定以下單詞：W1，W2，W3，W4，W5，W6

如果定義一個單詞的鄰域為這個單詞的前兩個單詞和後兩個單詞，那麼這6個單詞的鄰域如下：

單詞領域+-2

W1 W2，W3

W2 W1，W3，W4

W3 W1，W2，W4，W5

W4 W2，W3，W5，W6

W5 W3，W4，W6

W6 W4，W5

反轉排序的MapReduce/Hadoop實現：

我們要生成兩個資料序列。第一個序列三單詞的總領域計數（一個單詞的總共出現次數），用組合鍵（W，*）表示，其中W表示該單詞。

第二個序列是這個單詞和其他特定單詞出現次數，用組合鍵（W，W2）表示

Hadoop的定製分割槽器實現

定製分割槽器必須確保含有相同左詞（即自然鍵）的所有詞對要傳送到相同的規約器。例如，組合鍵{(man,tall),(man,strong),(man,moon)...}都要傳送到同一個規約器。要把這些鍵傳送到相同的規約器，必須定義一個定製分割槽器，它只關心左詞，如例中man。

相對頻度對映器

以W1，W2，W3，W4，W5，W6為例：

鍵值

(W1,W2) 1

(W1,W3) 1

(W1,*) 2

(W2,W1) 1

(W2,W3) 1

(W2,W4) 1

(W2,*) 3

(W3,W1) 1

(W3,W2) 1

(W3,W4) 1

(W3,W5) 1

(W3,*) 4

(W4,W2) 1

(W4,W3) 1

(W4,W5) 1

(W4,W6) 1

(W4,*) 4

(W5,W3) 1

(W5,W4) 1

(W5,W6) 1

(W5,*) 3

(W6,W4) 1

(W6,W6) 1

(W6,*) 1

相對頻度規約器

規約器的輸入：

鍵值

(W1,*)，(W1,W2) ，(W1,W3) 2，1，1

(W2，*)，（W2，W1），（W2，W3），（W2，W4） 3，1，1，1

....

程式碼實現：

自定義組合鍵

package fanzhuanpaixu;
 
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
//
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;

/**
 * Adapted from Cloud9: A Hadoop toolkit for working with big data
 * edu.umd.cloud9.io.pair.PairOfStrings 
 *
 * WritableComparable representing a pair of Strings. 
 * The elements in the pair are referred to as the left (word) 
 * and right (neighbor) elements. The natural sort order is:
 * first by the left element, and then by the right element.
 * 
 * @author Mahmoud Parsian (added few new convenient methods)
 *
 */
public class PairOfWords implements WritableComparable<PairOfWords> {

    private String leftElement;
    private String rightElement;

    /**
     * Creates a pair.
     */
    public PairOfWords() {
    }

    /**
     * Creates a pair.
     *
     * @param left the left element
     * @param right the right element
     */
    public PairOfWords(String left, String right) {
        set(left, right);
    }

    /**
     * Deserializes the pair.
     *
     * @param in source for raw byte representation
     */
    @Override
    public void readFields(DataInput in) throws IOException {
        leftElement = Text.readString(in);
        rightElement = Text.readString(in);
    }

    /**
     * Serializes this pair.
     *
     * @param out where to write the raw byte representation
     */
    @Override
    public void write(DataOutput out) throws IOException {
        Text.writeString(out, leftElement);
        Text.writeString(out, rightElement);
    }

    public void setLeftElement(String leftElement) {
        this.leftElement = leftElement;
    }

    public void setWord(String leftElement) {
        setLeftElement(leftElement);
    }

    /**
     * Returns the left element.
     *
     * @return the left element
     */
    public String getWord() {
        return leftElement;
    }

    /**
     * Returns the left element.
     *
     * @return the left element
     */
    public String getLeftElement() {
        return leftElement;
    }

    public void setRightElement(String rightElement) {
        this.rightElement = rightElement;
    }

    public void setNeighbor(String rightElement) {
        setRightElement(rightElement);
    }

    /**
     * Returns the right element.
     *
     * @return the right element
     */
    public String getRightElement() {
        return rightElement;
    }

    public String getNeighbor() {
        return rightElement;
    }

    /**
     * Returns the key (left element).
     *
     * @return the key
     */
    public String getKey() {
        return leftElement;
    }

    /**
     * Returns the value (right element).
     *
     * @return the value
     */
    public String getValue() {
        return rightElement;
    }

    /**
     * Sets the right and left elements of this pair.
     *
     * @param left the left element
     * @param right the right element
     */
    public void set(String left, String right) {
        leftElement = left;
        rightElement = right;
    }

    /**
     * Checks two pairs for equality.
     *
     * @param obj object for comparison
     * @return <code>true</code> if <code>obj</code> is equal to this object, <code>false</code> otherwise
     */
    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        //
        if (!(obj instanceof PairOfWords)) {
            return false;
        }
        //
        PairOfWords pair = (PairOfWords) obj;
        return leftElement.equals(pair.getLeftElement())
                && rightElement.equals(pair.getRightElement());
    }

    /**
     * Defines a natural sort order for pairs. Pairs are sorted first by the left element, and then by the right
     * element.
     *
     * @return a value less than zero, a value greater than zero, or zero if this pair should be sorted before, sorted
     * after, or is equal to <code>obj</code>.
     */
    @Override
    public int compareTo(PairOfWords pair) {
        String pl = pair.getLeftElement();
        String pr = pair.getRightElement();

        if (leftElement.equals(pl)) {
            return rightElement.compareTo(pr);
        }

        return leftElement.compareTo(pl);
    }

    /**
     * Returns a hash code value for the pair.
     *
     * @return hash code for the pair
     */
    @Override
    public int hashCode() {
        return leftElement.hashCode() + rightElement.hashCode();
    }

    /**
     * Generates human-readable String representation of this pair.
     *
     * @return human-readable String representation of this pair
     */
    @Override
    public String toString() {
        return "(" + leftElement + ", " + rightElement + ")";
    }

    /**
     * Clones this object.
     *
     * @return clone of this object
     */
    @Override
    public PairOfWords clone() {
        return new PairOfWords(this.leftElement, this.rightElement);
    }

    /**
     * Comparator optimized for <code>PairOfWords</code>.
     */
    public static class Comparator extends WritableComparator {

        /**
         * Creates a new Comparator optimized for <code>PairOfWords</code>.
         */
        public Comparator() {
            super(PairOfWords.class);
        }

        /**
         * Optimization hook.
         */
        @Override
        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            try {
                int firstVIntL1 = WritableUtils.decodeVIntSize(b1[s1]);
                int firstVIntL2 = WritableUtils.decodeVIntSize(b2[s2]);
                int firstStrL1 = readVInt(b1, s1);
                int firstStrL2 = readVInt(b2, s2);
                int cmp = compareBytes(b1, s1 + firstVIntL1, firstStrL1, b2, s2 + firstVIntL2, firstStrL2);
                if (cmp != 0) {
                    return cmp;
                }

                int secondVIntL1 = WritableUtils.decodeVIntSize(b1[s1 + firstVIntL1 + firstStrL1]);
                int secondVIntL2 = WritableUtils.decodeVIntSize(b2[s2 + firstVIntL2 + firstStrL2]);
                int secondStrL1 = readVInt(b1, s1 + firstVIntL1 + firstStrL1);
                int secondStrL2 = readVInt(b2, s2 + firstVIntL2 + firstStrL2);
                return compareBytes(b1, s1 + firstVIntL1 + firstStrL1 + secondVIntL1, secondStrL1, b2,
                        s2 + firstVIntL2 + firstStrL2 + secondVIntL2, secondStrL2);
            } catch (IOException e) {
                throw new IllegalArgumentException(e);
            }
        }
    }

    static { // register this comparator
        WritableComparator.define(PairOfWords.class, new Comparator());
    }
}

對映器

package fanzhuanpaixu;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//
import org.apache.commons.lang.StringUtils;
//
import java.io.IOException;

/**
 * RelativeFrequencyMapper implements the map() function for Relative Frequency of words.
 *
 * @author Mahmoud Parsian
 *
 */
public class RelativeFrequencyMapper
        extends Mapper<LongWritable, Text, PairOfWords, IntWritable> {

    private int neighborWindow = 2;
    // pair = (leftElement, rightElement)
    private final PairOfWords pair = new PairOfWords();
    private final IntWritable totalCount = new IntWritable();
    private static final IntWritable ONE = new IntWritable(1);

    @Override
    public void setup(Context context) {
        this.neighborWindow = context.getConfiguration().getInt("neighbor.window", 2);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String[] tokens = StringUtils.split(value.toString(), " ");
        //String[] tokens = StringUtils.split(value.toString(), "\\s+");
        if ((tokens == null) || (tokens.length < 2)) {
            return;
        }

        for (int i = 0; i < tokens.length; i++) {
            tokens[i] = tokens[i].replaceAll("\\W+", "");

            if (tokens[i].equals("")) {
                continue;
            }

            pair.setWord(tokens[i]);

            int start = (i - neighborWindow < 0) ? 0 : i - neighborWindow;
            int end = (i + neighborWindow >= tokens.length) ? tokens.length - 1 : i + neighborWindow;
            for (int j = start; j <= end; j++) {
                if (j == i) {
                    continue;
                }
                pair.setNeighbor(tokens[j].replaceAll("\\W", ""));
                context.write(pair, ONE);
            }
            //
            pair.setNeighbor("*");
            totalCount.set(end - start);
            context.write(pair, totalCount);
        }
    }
}

分割槽器

package fanzhuanpaixu;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;

/** 
 * This is an plug-in class.
 * The OrderInversionPartitioner class indicates how to partition data
 * based on the word only (the natural key).
 * 
 * @author Mahmoud Parsian
 *
 */
public class OrderInversionPartitioner 
   extends Partitioner<PairOfWords, IntWritable> {

    @Override
    public int getPartition(PairOfWords key, 
                            IntWritable value, 
                            int numberOfPartitions) {
        // key = (leftWord, rightWord) = (word, neighbor)
        String leftWord = key.getLeftElement();
        return Math.abs( ((int) hash(leftWord)) % numberOfPartitions);
    }
    
    /**
     * Return a hashCode() of a given String object.   
     * This is adapted from String.hashCode()
     *
     * @param str a string object
     *
     */
    private static long hash(String str) {
       long h = 1125899906842597L; // prime
       int length = str.length();
       for (int i = 0; i < length; i++) {
          h = 31*h + str.charAt(i);
       }
       return h;
    }	
    
}

組合器

package fanzhuanpaixu;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

/** 
 * RelativeFrequencyCombiner implements the combine() [in Hadoop, 
 * we use reduce() for implementing the combine() function] function 
 * for MapReduce/Hadoop.
 *
 * @author Mahmoud Parsian
 *
 */
public class RelativeFrequencyCombiner
        extends Reducer<PairOfWords, IntWritable, PairOfWords, IntWritable> {

    @Override
    protected void reduce(PairOfWords key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        //
        int partialSum = 0;
        for (IntWritable value : values) {
            partialSum += value.get();
        }
        //
        context.write(key, new IntWritable(partialSum));
    }
}

規約器

package fanzhuanpaixu;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

/**
 * RelativeFrequencyReducer implements the reduce() function for Relative Frequency of words.
 *
 * @author Mahmoud Parsian
 *
 */
public class RelativeFrequencyReducer
        extends Reducer<PairOfWords, IntWritable, PairOfWords, DoubleWritable> {

    private double totalCount = 0;
    private final DoubleWritable relativeCount = new DoubleWritable();
    private String currentWord = "NOT_DEFINED";

    @Override
    protected void reduce(PairOfWords key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        if (key.getNeighbor().equals("*")) {
            if (key.getWord().equals(currentWord)) {
                totalCount += totalCount + getTotalCount(values);
            } else {
                currentWord = key.getWord();
                totalCount = getTotalCount(values);
            }
        } else {
            int count = getTotalCount(values);
            relativeCount.set((double) count / totalCount);
            context.write(key, relativeCount);
        }

    }

    private int getTotalCount(Iterable<IntWritable> values) {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        return sum;
    }
}

驅動

package fanzhuanpaixu;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//
import org.apache.log4j.Logger;

/**
 * RelativeFrequencyDriver is driver class for computing relative frequency of words.
 *
 * @author Mahmoud Parsian
 *
 */
public class RelativeFrequencyDriver
        extends Configured implements Tool {

    private static final Logger THE_LOGGER = Logger.getLogger(RelativeFrequencyDriver.class);

    /**
     * Dispatches command-line arguments to the tool by the ToolRunner.
     */
    public static void main(String[] args) throws Exception {
    	args = new String[1];
    	args[0] = "2";
        int status = ToolRunner.run(new RelativeFrequencyDriver(), args);
        System.exit(status);
    }

    @Override
    public int run(String[] args) throws Exception {
        int neighborWindow = Integer.parseInt(args[0]);
        Path inputPath = new Path("input/paper.txt");
        Path outputPath = new Path("output/fanzhuanpaixu");

        Job job = new Job(new Configuration(), "RelativeFrequencyDriver");
        job.setJarByClass(RelativeFrequencyDriver.class);
        job.setJobName("RelativeFrequencyDriver");

        // Delete the output directory if it exists already
        FileSystem.get(getConf()).delete(outputPath, true);

        job.getConfiguration().setInt("neighbor.window", neighborWindow);

        FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        // (key,value) generated by map()
        job.setMapOutputKeyClass(PairOfWords.class);
        job.setMapOutputValueClass(IntWritable.class);

        // (key,value) generated by reduce()
        job.setOutputKeyClass(PairOfWords.class);
        job.setOutputValueClass(DoubleWritable.class);

        job.setMapperClass(RelativeFrequencyMapper.class);
        job.setReducerClass(RelativeFrequencyReducer.class);
        job.setCombinerClass(RelativeFrequencyCombiner.class);
        job.setPartitionerClass(OrderInversionPartitioner.class);
        job.setNumReduceTasks(3);

        long startTime = System.currentTimeMillis();
        job.waitForCompletion(true);
        THE_LOGGER.info("Job Finished in milliseconds: " + (System.currentTimeMillis() - startTime));
        return 0;
    }

}

執行結果：

Hadoop/MapReduce反轉排序：控制規約器Reducer值的順序

Hadoop/MapReduce反轉排序：控制規約器Reducer值的順序

零基礎讀懂視頻播放器控制原理： ffplay 播放器源代碼分析

Hadoop學習之路（十九）MapReduce框架排序

Hadoop之手寫原生態MapReduce的排序

DIM-00014：無法開啟windows nt服務控制管理器

Hadoop-MapReduce計算案例1：WordCount

Hadoop MapReduce二次排序演算法與實現之演算法解析

Python Hadoop Mapreduce 實現Hadoop Streaming分組和二次排序

ASP.NET Core中的依賴注入（1）：控制反轉（IoC）

wordcount報錯：org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:

Spring學習（1）：控制反轉（IoC）和依賴注入（DI）的詳解以及註解（annotation）開發入門案例

Hadoop二次排序及MapReduce處理流程例項詳解

hadoop初識之三：搭建hadoop環境（配置HDFS，Yarn及mapreduce 執行在yarn）上及三種執行模式（本地模式，偽分散式和分散式介）

Hadoop 學習研究(四)：MapReduce shuffle過程剖詳解及引數配置調優

Hadoop Mapreduce分割槽、分組、二次排序過程詳解[轉]

Spark：超越Hadoop MapReduce

Hadoop MapReduce開發--升序排序資料，且資料不去重

spring的兩大核心技術之一：控制反轉

IOS學習 iPad控制元件：POP控制元件器，分割視窗、浮動視窗、模態檢視的使用

Hadoop Mapreduce分割槽、分組、連線以及輔助排序（也叫二次排序）過程詳解

Hadoop/MapReduce反轉排序：控制規約器Reducer值的順序

相關推薦