Spak—— sparkCore原始碼解析之RangePartitioner原始碼

阿新 • • 發佈：2018-12-12

分割槽過程概覽

RangePartitioner分割槽執行原理：

計算總體的資料抽樣大小sampleSize，計算規則是：至少每個分割槽抽取20個數據或者最多1M的資料量。

根據sampleSize和分割槽數量計算每個分割槽的資料抽樣樣本數量最大值sampleSizePrePartition
根據以上兩個值進行水塘抽樣，返回RDD的總資料量，分割槽ID和每個分割槽的取樣資料。
計算出資料量較大的分割槽通過RDD.sample進行重新抽樣。
通過抽樣陣列 candidates: ArrayBuffer[(K, wiegth)]計算出分割槽邊界的陣列BoundsArray
在取資料時，如果分割槽數小於128則直接獲取，如果大於128則通過二分法，獲取當前Key屬於那個區間，返回對應的BoundsArray下標即為partitionsID

RangePartitioner

class RangePartitioner(partitions,rdd) {
// 1. 計算樣本大小
 val sampleSize = math.min(20.0 * partitions, 1e6)
// 2. 計算樣本最大值
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
// 3. 進行抽樣，返回總資料量，分割槽ID和樣本資料
val (numItems, sketched) = RangePartitioner.sketch(
    rdd.map(_._1), sampleSizePerPartition)
// 4. 是否需要二次取樣
val imbalancedPartitions = 	mutable.Set.empty[Int]
//  5. 儲存樣本資料的集合buffer:包含資料和權重
val candidates = ArrayBuffer.empty[(K, Float)]
 if (fraction * n > sampleSizePerPartition) {
	// 記錄需要重新取樣的RDD的ID
	imbalancedPartitions += idx 
}else{
// 5. 計算樣本權重
	val weight = (
	  // 取樣資料的佔比
		n.toDouble / sample.length).toFloat 
            for (key <- sample) {
			// 記錄取樣資料key和權重
              candidates += ((key, weight))
            }
	}
// 6. 對於資料分佈不均衡的RDD分割槽，重新資料抽樣
if (imbalancedPartitions.nonEmpty) {
	// 利用rdd的sample抽樣函式API進行資料抽樣
    val reSampled = imbalanced.sample(
		withReplacement = false, fraction, seed).collect()
}
// 7. 生成邊界陣列
RangePartitioner.determineBounds(candidates, partitions)
}

rangeBounds

 // An array of upper bounds for the first (partitions - 1) partitions
  private var rangeBounds: Array[K] = {
    if (partitions <= 1) {
      Array.empty
    } else {
      //  This is the sample size we need to have roughly balanced output partitions, capped at 1M.
      //  給定總的資料抽樣大小，最多1M的資料量(10^6)，最少20倍的RDD分割槽數量，也就是每個RDD分割槽至少抽取20條資料
      val sampleSize = math.min(20.0 * partitions, 1e6)
      // Assume the input partitions are roughly balanced and over-sample a little bit.
      // RDD各分割槽中的資料量可能會出現傾斜的情況，乘於3的目的就是保證資料量小的分割槽能夠取樣到足夠的資料，而對於資料量大的分割槽會進行第二次取樣
      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
      // 從rdd中抽樣得到的資料，返回值:(總資料量， Array[分割槽id，當前分割槽的資料量，當前分割槽抽取的資料])
      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
      if (numItems == 0L) {
        // 如果總的資料量為0(RDD為空)，那麼直接返回一個空的陣列
        Array.empty
      } else {
        // If a partition contains much more than the average number of items, we re-sample from it
        // to ensure that enough items are collected from that partition.
        // 計算是否需要重新取樣：如果分割槽包含的資料量遠遠大於平均取樣的資料量則重新進行分割槽
        // 樣本佔比：計算總樣本數量和總記錄數的佔比，佔比最大為1.0
        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
        //  儲存樣本資料的集合buffer:包含資料和權重
        val candidates = ArrayBuffer.empty[(K, Float)]
        // 儲存資料分佈不均衡的分割槽id(資料量超過fraction比率的分割槽)
        val imbalancedPartitions = mutable.Set.empty[Int]
        // 遍歷抽樣資料
        sketched.foreach { case (idx, n, sample) =>
          if (fraction * n > sampleSizePerPartition) {
            //  樣本數量佔比乘以當前RDD的總行數大於預設的每個RDD最大抽取數量，說明這個RDD的資料量比較大，需要取樣更多的資料：eg: 0.2*100=20<60;0.2*20000=2000>60
            // 如果樣本佔比乘以當前分割槽中的資料量大於之前計算的每個分割槽的抽象資料大小，那麼表示當前分割槽抽取的資料太少了，該分割槽資料分佈不均衡，需要重新抽取
            imbalancedPartitions += idx // 記錄需要重新取樣的RDD的ID
          } else {
            // The weight is 1 over the sampling probability.
            val weight = (n.toDouble / sample.length).toFloat // 取樣資料的佔比，RDD越大，權重越大
            for (key <- sample) {
              candidates += ((key, weight))
            }
          }
        }
        // 對於資料分佈不均衡的RDD分割槽，重新進行資料抽樣
        if (imbalancedPartitions.nonEmpty) {
          // Re-sample imbalanced partitions with the desired sampling probability.
          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
          val seed = byteswap32(-rdd.id - 1)
          // 利用rdd的sample抽樣函式API進行資料抽樣
          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
          val weight = (1.0 / fraction).toFloat
          candidates ++= reSampled.map(x => (x, weight))
        }
        // 將最終的抽樣資料計算出分割槽邊界陣列返回，邊界數組裡面存放的是RDD裡面資料的key值，
        // 比如array[0,10,20,30..]表明：key值在0到10的在第一個RDD，key值在10到20的在第二個RDD
        RangePartitioner.determineBounds(candidates, partitions)
      }
    }
  }

sketch

  def sketch[K : ClassTag](
      rdd: RDD[K],
      sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
    val shift = rdd.id
    // val classTagK = classTag[K] // to avoid serializing the entire partitioner object
    val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
      val seed = byteswap32(idx ^ (shift << 16))
      /*水塘抽樣：返回抽樣資料和RDD的總資料量*/
      val (sample, n) = SamplingUtils.reservoirSampleAndCount(
        iter, sampleSizePerPartition, seed)
      Iterator((idx, n, sample))
    }.collect()
    // 計算所有RDD的總資料量
    val numItems = sketched.map(_._2).sum
    (numItems, sketched)
  }

determineBounds

 /** 依據候選中的權重劃分分割槽，權重值可以理解為該Key值所代表的元素數目 返回一個數組，長度為partitions - 1,第i個元素作為第i個分割槽內元素key值的上界
   *  Determines the bounds for range partitioning from candidates with weights indicating how many
   *  items each represents. Usually this is 1 over the probability used to sample this candidate.
   *
   * @param candidates unordered candidates with weights 抽樣資料，包含了每個樣本的權重
   * @param partitions number of partitions 分割槽數量
   * @return selected bounds
   */
  def determineBounds[K : Ordering : ClassTag](
      candidates: ArrayBuffer[(K, Float)],
      partitions: Int): Array[K] = {
    val ordering = implicitly[Ordering[K]]
    //依據Key進行排序，升序，所以按區間分割槽後，各個分割槽是有序的
    val ordered = candidates.sortBy(_._1)
    // 取樣資料總數
    val numCandidates = ordered.size
    // //計算出權重和
    val sumWeights = ordered.map(_._2.toDouble).sum
    // 計算出步長：權重總數相當於預計資料總量，除以分割槽數就是每個分割槽的數量，得到的值即是按區間分割的區間步長
    val step = sumWeights / partitions
    var cumWeight = 0.0
    // 初始化target值為區間大小
    var target = step
    val bounds = ArrayBuffer.empty[K]
    var i = 0
    var j = 0
    var previousBound = Option.empty[K]
    // 遍歷取樣資料
    while ((i < numCandidates) && (j < partitions - 1)) {
      val (key, weight) = ordered(i)
      // 計算取樣資料在當前RDD中的位置，如果大於區間大小則：記錄邊界KEY值
      cumWeight += weight
      if (cumWeight >= target) {
        // Skip duplicate values. // 相同key值處於相同的Partition中，key值不同可以進行分割
        if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
          bounds += key //記錄邊界
          target += step
          j += 1
          previousBound = Some(key)
        }
      }
      i += 1
    }
    bounds.toArray
  }

getPartition

// 根據RDD的key值返回對應的分割槽id。從0開始
  def getPartition(key: Any): Int = {
    // 強制轉換key型別為RDD中原本的資料型別
    val k = key.asInstanceOf[K]
    var partition = 0
    if (rangeBounds.length <= 128) {
      // If we have less than 128 partitions naive search
      // 如果分割槽資料小於等於128個，那麼直接本地迴圈尋找當前k所屬的分割槽下標
      // ordering.gt(x,y):如果x>y,則返回true
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // 如果分割槽數量大於128個，那麼使用二分查詢方法尋找對應k所屬的下標;
      // 但是如果k在rangeBounds中沒有出現，實質上返回的是一個負數(範圍)或者是一個超過rangeBounds大小的數(最後一個分割槽，比所有資料都大)
      // Determine which binary search method to use only once.
      partition = binarySearch(rangeBounds, k)
      // binarySearch either returns the match location or -[insertion point]-1
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {
        partition = rangeBounds.length
      }
    }
    //  根據資料排序是升序還是降序進行資料的排列，預設為升序
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

Spak—— sparkCore原始碼解析之RangePartitioner原始碼

分割槽過程概覽 RangePartitioner分割槽執行原理：計算總體的資料抽樣大小sampleSize，計算規則是：至少每個分割槽抽取20個數據或者最多1M的資料量。根據sampleSize和分割槽數量計算每個分割槽的資料抽樣樣本數量最大值sampleSizePrePartition 根據以上

spark——sparkCore原始碼解析之RangePartitioner

HashPartitioner分割槽可能導致每個分割槽中資料量的不均勻。而RangePartitioner分割槽則儘量保證每個分割槽中資料量的均勻，將一定範圍內的數對映到某一個分割槽內。分割槽與分割槽之間資料是有序的，但分割槽內的元素是不能保證順序的。 RangePartitioner分割槽執行原理：

Android框架原始碼解析之（四）Picasso

這次要分析的原始碼是 Picasso 2.5.2 ，四年前的版本，用eclipse寫的，但不影響這次我們對其原始碼的分析地址：https://github.com/square/picasso/tree/picasso-parent-2.5.2 Picasso的簡單使用

Android框架原始碼解析之（三）ButterKnife

注：所有分析基於butterknife:8.4.0 原始碼目錄：https://github.com/JakeWharton/butterknife 其中最主要的3個模組是： Butterknife註解處理器https://github.com/JakeWharton/

Android框架原始碼解析之（二）OKhttp

原始碼在：https://github.com/square/okhttp 包實在是太多了，OKhttp核心在這塊https://github.com/square/okhttp/tree/master/okhttp 直接匯入Android Studio中即可。基本使用：

Android框架原始碼解析之（一）Volley

前幾天面試CVTE，HR面掛了。讓內部一個學長幫我查看了一下面試官評價，發現二面面試官的評價如下：廣度OK，但缺乏深究能力，深度與實踐不足原始碼：只能說流程，細節程式碼不清楚，retrofit和volley都是。感覺自己一方面：自己面試技巧有待提高吧（框

Android原始碼解析之應用程式資源管理器（Asset Manager）的建立過程分析

轉載自：https://blog.csdn.net/luoshengyang/article/details/8791064 我們分析了Android應用程式資源的編譯和打包過程，最終得到的應用程式資源就與應用程式程式碼一起打包在一個APK檔案中。Android應用程式在執行的過程中，是通過一個

Spring-web原始碼解析之Filter-OncePerRequestFilter

轉自： http://blog.csdn.net/ktlifeng/article/details/50630934 基於4.1.7.RELEASE 我們先看一個filter-mapping的配置

spring原始碼解析之AOP原理

一、準備工作　　在這裡我先簡單記錄下如何實現一個aop： AOP：【動態代理】指在程式執行期間動態的將某段程式碼切入到指定方法指定位置進行執行的程式設計方式； 1、匯入aop模組；Spring AOP：(spring-aspects) 2、定義一個業務邏輯類（

Dubbo原始碼解析之服務端接收訊息

準備 dubbo 版本：2.5.4 服務端接收訊息流程 Handler鏈路 DubboProtocol private ExchangeServer createServer(URL url) { url = url.addParameterIfAbsent("c

Dubbo原始碼解析之服務釋出與註冊

準備 dubbo版本：2.5.4 Spring自定義擴充套件 dubbo 是基於 spring 配置來實現服務釋出，並基於 spring 的擴充套件機制定義了一套自定義標籤，要實現自定義擴充套件， spring 中提供了 NamespaceHandler 、BeanDefinit

MyBatis原始碼解析之日誌記錄

一 .概述 MyBatis沒有提供日誌的實現類，需要接入第三方的日誌元件，但第三方日誌元件都有各自的Log級別，且各不相同，但MyBatis統一提供了trace、debug、warn、error四個級別；自動掃描日誌實現，並且第三方日誌外掛載入優先順序如下：slf4J → commonsLoging →

MyBatis原始碼解析之資料來源（含資料庫連線池簡析）

一.概述：常見的資料來源元件都實現了javax.sql.DataSource介面； MyBatis不但要能整合第三方的資料來源元件，自身也提供了資料來源的實現；一般情況下，資料來源的初始化過程引數較多，比較複雜；二.設計模式：為什麼要使用工廠模式資料來

Spring原始碼解析之 Spring Security啟動細節和工作模式

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

Laravel原始碼解析之反射的使用

前言 PHP的反射類與例項化物件作用相反，例項化是呼叫封裝類中的方法、成員，而反射類則是拆封類中的所有方法、成員變數，幷包括私有方法等。就如“解刨”一樣，我們可以呼叫任何關鍵字修飾的方法、成員。當然在正常業務中是建議不使用，比較反射類已經摒棄了封裝的概念。本章講解反射類的使用及Laravel對反射的使用

hanlp原始碼解析之中文分詞演算法詳解

詞圖詞圖指的是句子中所有詞可能構成的圖。如果一個詞A的下一個詞可能是B的話，那麼A和B之間具有一條路徑E(A,B)。一個詞可能有多個後續，同時也可能有多個前驅，它們構成的圖我稱作詞圖。需要稀疏2維矩陣模型，以一個詞的起始位置作為行，終止位置作為列，可以得到一個二維矩陣。例如：“他說的確實

MapReduce原始碼解析之Mapper

MapReduce原始碼解析之Mapper 北京易觀智庫網路科技有限公司作者：賀斌摘要：詳解MapReduce中Map（對映）的實現者Mapper。導語：說起MapReduce，只要是大資料領域的小夥伴，相信都不陌生。它作為Hadoop生態系統中的一部分，最早是由G

高併發程式設計thirft原始碼解析之Selector

Selector作用關於套接字程式設計，有一套經典的IO模型需要提前介紹一下：. 同步IO模型：阻塞式IO模型非阻塞式IO模型 IO複用模型使用selector 訊號驅動式IO模型非同步IO模型使用aio_read thri

Vue原始碼解析之nextTick

Vue原始碼解析之nextTick 前言 nextTick是Vue的一個核心功能，在Vue內部實現中也經常用到nextTick。但是，很多新手不理解nextTick的原理，甚至不清楚nextTick的作用。那麼，我們就先來看看nextTick是什麼。 nextTick功能看看

vue 原始碼解析之 data的省略用法

var vu = new vue( { data() { name: kk age: 123 } }) vue中獲取 name 有如下幾種寫法, 1 vu.name 2 vu.$data.name 其實他們實際都是獲取的 vu._data.name 第一種的原始碼在 function initDa

Spak—— sparkCore原始碼解析之RangePartitioner原始碼

分割槽過程概覽

rangeBounds

sketch

determineBounds

相關推薦