Hadoop Mapreduce 中的Partitioner

阿新 • • 發佈：2019-02-19

alsa max one 輸入階段負載均衡。均衡 total cal

Partitioner的作用的對Mapper產生的中間結果進行分片，以便將同一分組的數據交給同一個Reduce處理，Partitioner直接影響Reduce階段的負載均衡。

MapReduce提供了兩個Partitioner實現：HashPartitioner和TotalOederPartitioner。
HashPartitioner是默認實現，實現了一種基於哈希值的分片方法，代碼如下：

public int getPartition(K2 key, V2 value, int numReduceTasks) {
     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

TotalOrderPartitioner提供了一種基於區間的分片方法，通常用在數據全排序中。
在MapReduce環境中，容易想到的全排序方案是歸並排序，即在Map階段，每個Map Task進行局部排序；在Reduce階段，啟動一個Reduce Task進行全局排序。由於作業只能有一個Reduce Task，因而reduce階段會成為作業的瓶頸。
TotalOrderPartitioner能夠按照大小將數據分成若幹個區間（分片），並保證後一個區間的所有數據均大於前一個區間的所有數據。全排序的步驟如下：

數據采樣。在Client端通過采樣獲取分片的分割點。Hadoop自帶了幾個采樣算法，如IntercalSampler、RandomSampler、SplitSampler等。

Map階段。本階段涉及兩個組件，分別是Mapper和Partitioner。其中，Mapper可采用IdentityMapper，直接將輸入數據輸出，但Partitioner必須選用TotalOrderPartitioner，它將步驟1中獲取的分割點保存到trie樹中以便快速定位任意一個記錄所在的區間，這樣，每個Map Task產生R（Reduce Task 個數）個區間，且區間有序。TotalOrderPartitioner通過trie樹查找每條記錄所對應的Reduce Task編號。
Reduce階段。每個Reducer對分配到的區間數據進行局部排序，最終得到全排序數據。

基於TotalOrderPartitioner全排序的效率跟key分布規律和采樣算法有直接關系；key值分布越均勻且采樣越具有代表性，則Reduce Task負載越均衡，全排序效率越高。

Hadoop Mapreduce 中的Partitioner

alsa max one 輸入階段負載均衡。均衡 total cal Partitioner的作用的對Mapper產生的中間結果進行分片，以便將同一分組的數據交給同一個Reduce處理，Partitioner直接影響Reduce階段的負載均衡。 MapReduce提供

Hadoop Mapreduce 中的Partitioner

Hadoop Mapreduce 中的Partitioner

Hadoop MapReduce中map任務數量設定詳解

Hadoop學習之路（二十三）MapReduce中的shuffle詳解

MapReduce中的分割槽方法Partitioner

【圖文詳細】MapReduce 中的 Partitioner

在Hadoop平臺中執行MapReduce WordCount程式

大資料07-Hadoop框架下MapReduce中的map個數如何控制

Hadoop中 MapReduce中InputSplit的分析

Hadoop學習筆記—11.MapReduce中的排序和分組

Hadoop學習筆記—12.MapReduce中的常見演算法

【MapReduce】MapReduce中的分割槽方法Partitioner

MapReduce 中的 Partitioner

Hadoop Mapreduce 統計hbase表的行數並且寫入到另一張表格中

將OpenStack私有云部署到Hadoop MapReduce環境中

mapreduce中reduce中的叠代器只能調用一次！

Hadoop Mapreduce之WordCount實現

MapReduce中combine、partition、shuffle的作用是什麽

16-hadoop-mapreduce簡介

Hadoop MapReduce輸入輸出類型

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/input

Hadoop Mapreduce 中的Partitioner

相關推薦