大資料開發-從cogroup的實現來看join是寬依賴還是窄依賴

阿新 • • 發佈：2021-02-13

前面一篇文章提到大資料開發-Spark Join原理詳解,本文從原始碼角度來看cogroup 的join實現 # 1.分析下面的程式碼 ```Scala import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object JoinDemo { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("WARN") val random = scala.util.Random val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx")) val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD")) val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) val rdd2: RDD[(Int, String)] = sc.makeRDD(col2) val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) println(rdd3.dependencies) val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) println(rdd4.dependencies) sc.stop() } } ``` 分析上面一段程式碼，列印結果是什麼，這種join是寬依賴還是窄依賴，為什麼是這樣 # 2.從spark的ui介面來檢視執行情況關於stage劃分和寬依賴窄依賴的關係，從2.1.3 如何區別寬依賴和窄依賴就知道stage與寬依賴對應，所以從rdd3和rdd4的stage的依賴圖就可以區別寬依賴，可以看到join劃分除了新的stage，所以rdd3的生成事寬依賴，另外`rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))` 是另外的依賴圖，所以可以看到partitionBy以後再沒有劃分新的 stage，所以是窄依賴。 ![file](http://jeff.sp![file](http://jeff.spring4all.com/FhBTWpUB7pu_DyRGp4VZ9IMKx1MW)ring4all.com/Ft9l6DJcMGv4Z_pGWtxmnN9FSv6f) ![Uploading file...]() # 3.join的原始碼實現前面知道結論，是從ui圖裡面看到的，現在看join原始碼是如何實現的（基於spark2.4.5）先進去入口方法，其中withScope的做法可以理解為裝飾器，為了在sparkUI中能展示更多的資訊。所以把所有建立的RDD的方法都包裹起來，同時用RDDOperationScope 記錄 RDD 的操作歷史和關聯，就能達成目標。 ```Scala /** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Performs a hash join across the cluster. */ def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope { join(other, defaultPartitioner(self, other)) } ``` 下面來看`defaultPartitioner` 的實現，其目的就是在預設值和分割槽器之間取一個較大的，返回分割槽器 ```Scala def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val rdds = (Seq(rdd) ++ others) // 判斷有沒有設定分割槽器partitioner val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0)) //如果設定了partitioner，則取設定partitioner的最大分割槽數 val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) { Some(hasPartitioner.maxBy(_.partitions.length)) } else { None } //判斷是否設定了spark.default.parallelism，如果設定了則取spark.default.parallelism val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) { rdd.context.defaultParallelism } else { rdds.map(_.partitions.length).max } // If the existing max partitioner is an eligible one, or its partitions number is larger // than the default number of partitions, use the existing partitioner. //主要判斷傳入rdd是否設定了預設的partitioner 以及設定的partitioner是否合法 //或者設定的partitioner分割槽數大於預設的分割槽數 //條件成立則取傳入rdd最大的分割槽數，否則取預設的分割槽數 if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) || defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) { hasMaxPartitioner.get.partitioner.get } else { new HashPartitioner(defaultNumPartitions) } } private def isEligiblePartitioner( hasMaxPartitioner: RDD[_], rdds: Seq[RDD[_]]): Boolean = { val maxPartitions = rdds.map(_.partitions.length).max log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1 } } ``` 再進入join的過載方法，裡面有個`new CoGroupedRDD[K](Seq(self, other), partitioner)` ```Scala def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) } def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } //partitioner 通過對比得到的預設分割槽器，主要是分割槽器中的分割槽數 val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) =>

(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) } } /** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Performs a hash join across the cluster. */ def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope { join(other, new HashPartitioner(numPartitions)) } ``` 最後來看CoGroupedRDD，這是決定是寬依賴還是窄依賴的地方，可以看到如果左邊rdd的分割槽和上面選擇給定的分割槽器一致，則認為是窄依賴，否則是寬依賴 ```Scala override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_] =>

if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer) } } } ``` 總結，join時候可以指定分割槽數，如果join操作左右的rdd的**分割槽方式和分割槽數一致**則不會產生shuffle，否則就會shuffle，而是寬依賴，分割槽方式和分割槽數的體現就是分割槽器。吳邪，小三爺，混跡於後臺，大資料，人工智慧領域的小菜鳥。更多請關注 ![file](https://img2020.cnblogs.com/other/669466/202102/669466-20210212214841280-1408744

大資料開發-從cogroup的實現來看join是寬依賴還是窄依賴

大資料開發-從cogroup的實現來看join是寬依賴還是窄依賴

寫給初學大資料的你，從零開始學習大資料開發的完整學習路線

大資料開發-Spark Join原理詳解

大資料開發學習路線圖

大資料開發----Hive（入門篇）

大資料開發----Fabric（入門篇）

職業發展之大資料開發工程師理解

零基礎怎麼學spark？大資料開發學習

全網最牛乾貨！！！年薪80萬的大資料開發【教程】讀完此文全掌握！！！

學習大資料必須瞭解的大資料開發課程大綱

接地氣，到底什麼才是大資料開發工程師？

掌握Spark機器學習庫大資料開發技能更進一步

學習大資料開發需要讀的書籍有哪些？大資料開發書籍推薦介紹

大資料開發之Hadoop篇----pid檔案剖析

大資料開發之Hadoop篇----hdfs讀寫許可權操作

大資料開發之Hadoop篇----mapreduce概念以及架構

大資料開發之Hadoop篇----hdfs垃圾回收機制配置

大資料開發之Hadoop篇----提交作業到yarn上的流程

大資料開發之Hadoop篇----hdfs讀流程

大資料開發之Hadoop篇----hdfs dfsadmin命令

大資料開發-從cogroup的實現來看join是寬依賴還是窄依賴

相關推薦