spark原始碼解讀4之SortByKey

阿新 • • 發佈：2018-12-25

spark原始碼解讀系列環境：spark-2.0.1 （20161103github下載版）

1.理解

1.1 需求

使用spark的時候會經常使用sortBykey，比如wordCount後需要排序，可以使用sortBy，也可以先map然後再sortByKey，soerBy也是呼叫SortByKey

1.2 原始碼

SortByKey：org.apache.spark.rdd.OrderedRDDFunctions#sortByKey

  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

sortBy：org.apache.spark.rdd.RDD#sortBy

  /**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

1.3 分析

1.3.1 sortByKey之資料partitioner類RangePartitioner

sortByKey使用了RangePartitioner，這個在前面的博文“spark原始碼解讀1之Partitioner”中已經有初步分析。RangePartitioner能很大程度上避免hash出現數據的資料分佈不均勻的情況

RangePartitioner會在determineBounds對邊界進行排序，用的是scala.collection.SeqLike#sorted ，呼叫的是java.util.Arrays#sort(T[], java.util.Comparator

1.3.2 ShuffleRDD

new ShuffledRDD並且返回，即為排序好的soetByKey的結果

1.3.2.1 partitions_屬性
new ShuffledRDD的資料儲存在partitions_屬性中，這個繼承自父類RDD，final方法partitions會給partitions_賦值，呼叫的是getPartitions方法，然後zipWithIndex

partitions原始碼：

  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
        partitions_.zipWithIndex.foreach { case (partition, index) =>
          require(partition.index == index,
            s"partitions($index).partition == ${partition.index}, but it should equal $index")
        }
      }
      partitions_
    }
  }

1.3.2.2 getPartitions方法

getPartitions在ShuffledRDD重寫了：

 override def getPartitions: Array[Partition] = {
    Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
  }

part.numPartitions實際為1.3.1中傳入的RangePartitioner的屬性：

def numPartitions: Int = rangeBounds.length + 1

而rangeBounds則是用水塘抽樣演算法（Reservoir Sampling）建立的邊界範圍，

 1 = 632826677
 0 = -841013005
rangeBounds = {int[2]@5390}

getPartitions後是確定RDD的partition數量和index

只有當進行取數操作時，比如top（k）然後顯示，資料才會劃分到partitions_的每個values下

debug資料copy：

"WrappedArray$ofRef" size = 333
values = {[email protected]} "WrappedArray$ofRef" size = 333
 0 = {[email protected]} "(-1813557161,-1212512531)"
 1 = {[email protected]} "(-1144323740,933490971)"
 2 = {[email protected]} "(-12508600,-329995331)"
 3 = {[email protected]} "(-1570574142,-743284380)"
 5 = {[email protected]} "(-532362478,1106605038)"
 4 = {[email protected]} "(249668146,-1487774671)"
 6 = {[email protected]} "(-146176592,666226908)"

本地debug的程式碼是：

  test("large array") {
    val rand = new scala.util.Random()
    val pairArr = Array.fill(1000) {
      (rand.nextInt(), rand.nextInt())
    }
    val pairs = sc.parallelize(pairArr, 3)
    val sorted = pairs.sortByKey()
    sorted.count()
    sorted.top(3).foreach(println)
    assert(sorted.partitions.size === 3)
    assert(sorted.collect() === pairArr.sortBy(_._1))
  }

ShuffledRDD的partitions_對應的是三個ParallelCollectionPartition，這個是RDD的依賴關係得到的，ParallelCollectionPartition類重寫了getPartitions方法，所以

  override def getPartitions: Array[Partition] = {
    val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
  }

裡面partition 的排序方法沒找到，不知道邏輯，需要後續去學習RDD和DAG、Stage等程式碼。

2.程式碼：

sortByKey使用：org.apache.spark.rdd.SortingSuite

  test("sortByKey") {
    val pairs = sc.parallelize(Array((1, 0), (2, 0), (0, 0), (3, 0)), 2)
    assert(pairs.sortByKey().collect() === Array((0, 0), (1, 0), (2, 0), (3, 0)))
  }

sortBy：org.apache.spark.rdd.RDDSuite

 test("sortBy") {
    val data = sc.parallelize(Seq("5|50|A", "4|60|C", "6|40|B"))

    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    val col3 = Array("5|50|A", "6|40|B", "4|60|C")

    assert(data.sortBy(_.split("\|")(0)).collect() === col1)
    assert(data.sortBy(_.split("\|")(1)).collect() === col2)
    assert(data.sortBy(_.split("\|")(2)).collect() === col3)
  }

3.結果：

3.1 TimSort有待學習
3.2 RangePartitioner只是確定numPartitions和getPartition(key: Any)，partition內部如何排序沒有看到

參考

【1】http://spark.apache.org/
【2】http://spark.apache.org/docs/1.5.2/programming-guide.html
【3】https://github.com/xubo245/SparkLearning
【4】book:《深入理解spark核心思想與原始碼分析》
【5】book:《spark核心原始碼分析和開發實戰》
【6】http://blog.csdn.net/u014393917/article/details/50602047

spark原始碼解讀4之SortByKey

1.理解

1.1 需求

1.2 原始碼

1.3 分析

1.3.1 sortByKey之資料partitioner類RangePartitioner

1.3.2 ShuffleRDD

2.程式碼：

3.結果：

spark原始碼解讀4之SortByKey

spark原始碼解讀1之Partitioner

spark原始碼解讀3之RDD中top原始碼解讀

spark原始碼解讀2之水塘抽樣演算法（Reservoir Sampling）

Spark原始碼解讀之RDD構建和轉換過程

Spark原始碼解讀--任務生成和提交過程

CYYMysql 原始碼解讀 4

Spark原始碼閱讀筆記之Broadcast（三）

Redux原始碼解讀 [05] 之 applyMiddleware和compose

Redux原始碼解讀 [03] 之 bindActionCreators

Spark-Avro學習4之使用AvroWritePartitioned儲存AVRO檔案時進行劃分

4、Spring原始碼分析4之初始化Bean

【Spring Cloud 原始碼解讀】之【這也太神奇了，RestTemplate加上一個@LoadBalanced註解就能實現負載均衡！】

【Spring Cloud 原始碼解讀】之【如何配置好OpenFeign的各種超時時間！】

【Kafka 原始碼解讀】之【程式碼沒報錯但是訊息卻傳送失敗！】

Prism 原始碼解讀4-ViewModel注入

【4】pytorch torchvision原始碼解讀之ResNet

Spark Streaming原始碼解讀之Receiver在Driver的精妙實現全生命週期徹底研究和思考

Spark 定製版：015~Spark Streaming原始碼解讀之No Receivers徹底思考

Spark Streaming原始碼解讀之資料清理內幕徹底解密

spark原始碼解讀4之SortByKey

1.理解

1.1 需求

1.2 原始碼

1.3 分析

1.3.1 sortByKey之資料partitioner類RangePartitioner

1.3.2 ShuffleRDD

2.程式碼：

3.結果：

相關推薦