1. 程式人生 > >spark2.0原理分析--RDD Lineage(邏輯執行計劃)

spark2.0原理分析--RDD Lineage(邏輯執行計劃)

RDD Lineage(又稱為RDD運算圖或RDD依賴關係圖)是RDD所有父RDD的graph(圖)。它是在RDD上執行transformations函式並建立logical execution plan(邏輯執行計劃)的結果。它是RDD的邏輯執行計劃。
注意: execution DAG或physical execution plan(物理執行計劃)是DAG of stages(stage的DAG)。
這裡寫圖片描述

上圖是執行以下語句得到的RDD Lineage結果:

val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10
) val r10 = r00 cartesian r01 val r11 = r00.map(n => (n, n)) val r12 = r00 zip r01 val r13 = r01.keyBy(_ / 20) val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

我們可以執行toDebugString列印RDD的Lineage:

scala> r00.toDebugString
res5: String = (20) ParallelCollectionRDD[0] at parallelize at <console>:27 []

scala> r01.toDebugString
res6: String = (20
) ParallelCollectionRDD[1] at parallelize at <console>:27 [] scala> r12.toDebugString res9: String = (20) ZippedPartitionsRDD2[4] at zip at <console>:31 [] | ParallelCollectionRDD[0] at parallelize at <console>:27 [] | ParallelCollectionRDD[1] at parallelize at <console>:27
[] scala> r13.toDebugString res10: String = (20) MapPartitionsRDD[5] at keyBy at <console>:29 [] | ParallelCollectionRDD[1] at parallelize at <console>:27 [] scala> r20.toDebugString res11: String = (460) UnionRDD[8] at union at <console>:39 [] | UnionRDD[7] at union at <console>:39 [] | UnionRDD[6] at union at <console>:39 [] | CartesianRDD[2] at cartesian at <console>:31 [] | ParallelCollectionRDD[0] at parallelize at <console>:27 [] | ParallelCollectionRDD[1] at parallelize at <console>:27 [] | MapPartitionsRDD[3] at map at <console>:29 [] | ParallelCollectionRDD[0] at parallelize at <console>:27 [] | ZippedPartitionsRDD2[4] at zip at <console>:31 [] | ParallelCollectionRDD[0] at parallelize at <console>:27 [] | ParallelCollectionRDD[1] at parallelize at <console>:27 [] | MapPartitionsRDD[5] at keyBy at <console>:29 [] | ParallelCollectionRDD[1] at parallelize at <console>:27 []

從以上輸出可以看出,RDD譜系圖是在呼叫action函式後需要執行哪些transformation(變換)的graph(圖)。

toDebugString

函式原型:

def toDebugString: String

通過此函式可以獲取RDD的Lineage列印輸出。

設定列印Lineage

引數:spark.logLineage
預設值:false
設定為true時,會在執行中打印出RDD的Lineage。

Logical Execution Plan(邏輯執行計劃)

logical execution plan(邏輯執行計劃)從最早(頭結點-沒有父結點的節點)的RDD(不依賴於其他RDD或引用快取資料的RDD)開始,並以生成被執行的action操作結果的RDD結束。