spark RDD,reduceByKey vs groupByKey
Spark 中有兩個類似的api,分別是 reduceByKey 和 groupByKey 。這兩個的功能類似,但底層實現卻有些不同,那麼為什麼要這樣設計呢?我們來從原始碼的角度分析一下。
先看兩者的呼叫順序(都是使用預設的Partitioner,即defaultPartitioner)
所用 spark 版本:spark 2.1.0
先看reduceByKey
Step1
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope { reduceByKey(defaultPartitioner(self), func) }
Setp2
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
Setp3
def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("HashPartitioner cannot partition array keys.") } } val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) } }
姑且不去看方法裡面的細節,我們會只要知道最後呼叫的是 combineByKeyWithClassTag 這個方法。這個方法有兩個引數我們來重點看一下,
def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)
首先是 partitioner 引數 ,這個即是 RDD 的分割槽設定。除了預設的 defaultPartitioner,Spark 還提供了 RangePartitioner 和 HashPartitioner 外,此外使用者也可以自定義 partitioner 。通過原始碼可以發現如果是 HashPartitioner 的話,那麼是會丟擲一個錯誤的。
然後是 mapSideCombine 引數 ,這個引數正是 reduceByKey 和 groupByKey 最大不同的地方,它決定是是否會先在節點上進行一次 Combine 操作,下面會有更具體的例子來介紹。
然後是groupByKey
Step1
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
Step2
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
Setp3
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
結合上面 reduceByKey 的呼叫鏈,可以發現最終其實都是呼叫 combineByKeyWithClassTag 這個方法的,但呼叫的引數不同。 reduceByKey的呼叫
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
groupByKey的呼叫
combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
正是兩者不同的呼叫方式導致了兩個方法的差別,我們分別來看
reduceByKey的泛型引數直接是[V],而groupByKey的泛型引數是[CompactBuffer[V]]。這直接導致了 reduceByKey 和 groupByKey 的返回值不同,前者是RDD[(K, V)],而後者是RDD[(K, Iterable[V])]
然後就是mapSideCombine = false 了,這個mapSideCombine 引數的預設是true的。這個值有什麼用呢,上面也說了,這個引數的作用是控制要不要在map端進行初步合併(Combine)。可以看看下面具體的例子。
從功能上來說,可以發現 ReduceByKey 其實就是會在每個節點先進行一次合併的操作,而 groupByKey 沒有。
這麼來看 ReduceByKey 的效能會比 groupByKey 好很多,因為有些工作在節點已經處理了。那麼 groupByKey 為什麼存在,它的應用場景是什麼呢?我也不清楚,如果觀看這篇文章的讀者知道的話不妨在評論裡說出來吧。非常感謝!