Spark BlockManager原始碼分析

阿新 • • 發佈：2018-12-26

Spark BlockManager原始碼分析

前置條件

Hadoop版本: Hadoop 2.6.0-cdh5.15.0
Spark版本: SPARK 1.6.0-cdh5.15.0

JDK.1.8.0_191
scala2.10.7

技能標籤

Spark Block 折分local 和 remote blocks
local Block讀取的資料是 FileSegmentManagedBuffer物件包含//(ShuffleBlockId,Block資料檔案偏移量，長度，資料檔案位置)
remote blocks 往變數results 放物件new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf)

描述

ShuffleRDD讀取資料

ShuffleRDD讀到的資料是上一個RDD輸出資料，即ShuffleMapTask的輸出結果(一個分割槽資料檔案和一個分割槽對應的偏移量檔案)
分割槽資料檔案是有序的，先按分割槽數從小到大排序，再按(key,value)中的key從小到大進行排序
通過BlockStoreShuffleReader.read()對Block資料進行讀取
通過 MapOutputTracker得到[(BlockManagerId,[(ShuffleBlockId,長度)])],相當於有多少臺機器進行executor運算，就有多少個BlockManagerId(即有多少個BlockManager進行管理),每個BlockManager管理著多個數據分割槽(即多個ShuffleBlockId)

通過 ShuffleBlockFetcherIterator()進行本地Block和遠端Block資料進行處理，會得到一個 (BlockID, InputStream)這樣元素的迭代器，然後進行flatMap()操作，返序列化流物件並轉化為KeyValueIterator迭代器，通過兩個迭代器就可以遍歷當前分割槽的所有輸入資料，由於Block資料塊預設的大小是128m，也就是每個BlockID最大資料大小是128m，而且是一次只處理一個BlockID資料，所以此時是不會記憶體益處的
然後對迭代器中的(key,value)元素進行遍歷，並插入ExternalAppendOnlyMap物件，如果資料滿足插入臨時檔案的條件，會把記憶體中的資料插入到臨時檔案，並儲存DiskMapIterator物件到變數spilledMaps中，就是防止資料太大，記憶體溢位

本地Block

本地Block是BlockManager負責拉取
即ShuffleMapTask和ResultTask在同一臺機器上執行時，這個時候只需要通過

遠端Block

遠和Block是ShuffleClient負責呼叫，BlockTransferService負責拉取

ResultTask讀取ShuffleMapTask輸出的資料，ResultTask讀取的RDD分割槽檔案分兩種，一種中本臺機器上的資料檔案(也就是ResultTask和ShuffleMapTask在同一臺機器上),這時能過BlockManager就可以讀到對應分割槽的資料檔案

ShuffleRDD

ShuffleRDD.compute()方法


  override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    //SortShuffleManager.getReader()方法反回BlockStoreShuffleReader
    //呼叫BlockStoreShuffleReader.read()方法
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()
      .asInstanceOf[Iterator[(K, C)]]
  }

BlockStoreShuffleReader.read()方法
讀到當前reduce task 的當前分割槽的需要合併的key-values

/** Read the combined key-values for this reduce task */
  override def read(): Iterator[Product2[K, C]] = {
    val streamWrapper: (BlockId, InputStream) => InputStream = { (blockId, in) =>
      blockManager.wrapForCompression(blockId,
        CryptoStreamUtils.wrapForEncryption(in, blockManager.conf))
    }

    //通過 ShuffleBlockFetcherIterator()進行本地Block和遠端Block資料進行處理，會得到一個 (BlockID, InputStream)這樣元素的迭代器，然後進行flatMap()操作，返序列化流物件並轉化為KeyValueIterator迭代器，通過兩個迭代器就可以遍歷當前分割槽的所有輸入資料，由於Block資料塊預設的大小是128m，也就是每個BlockID最大資料大小是128m，而且是一次只處理一個BlockID資料，所以此時是不會記憶體益處的
    val wrappedStreams = new ShuffleBlockFetcherIterator(
      context,
      blockManager.shuffleClient,
      blockManager,
      //通過 MapOutputTracker得到[(BlockManagerId,[(ShuffleBlockId,長度)])],相當於有多少臺機器進行executor運算，就有多少個BlockManagerId(即有多少個BlockManager進行管理),每個BlockManager管理著多個數據分割槽(即多個ShuffleBlockId)
      mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
      streamWrapper,
      // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
      SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
      SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))

    val ser = Serializer.getSerializer(dep.serializer)
    val serializerInstance = ser.newInstance()

    // Create a key/value iterator for each stream
    val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
      // Note: the asKeyValueIterator below wraps a key/value iterator inside of a
      // NextIterator. The NextIterator makes sure that close() is called on the
      // underlying InputStream when all records have been read.
      serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
    }

    // Update the context task metrics for each record read.
    val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency()
    val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
      recordIter.map(record => {
        readMetrics.incRecordsRead(1)
        record
      }),
      context.taskMetrics().updateShuffleReadMetrics())

    // An interruptible iterator must be used here in order to support task cancellation
    val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)

    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // We are reading values that are already combined
        val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
        //然後對迭代器中的(key,value)元素進行遍歷，並插入ExternalAppendOnlyMap物件，如果資料滿足插入臨時檔案的條件，會把記憶體中的資料插入到臨時檔案，並儲存DiskMapIterator物件到變數spilledMaps中，就是防止資料太大，記憶體溢位
        dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
      } else {
        // We don't know the value type, but also don't care -- the dependency *should*
        // have made sure its compatible w/ this aggregator, which will convert the value
        // type to the combined type C
        val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
        dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
    }

    // Sort the output if there is a sort ordering defined.
    //如查定義的排序就進行排序操作，如果沒有就不排序
    dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        // Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
        // the ExternalSorter won't spill to disk.
        val sorter =
          new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = Some(ser))
        sorter.insertAll(aggregatedIter)
        context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
        context.internalMetricsToAccumulators(
          InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes)
        CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
      case None =>
        aggregatedIter
    }
  }

ShuffleBlockFetcherIterator.initialize()


 private[this] def initialize(): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener(_ => cleanup())

    // Split local and remote blocks.
    //拆分本地Block 和 遠端Block
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    
    //隨機取遠端的
    fetchRequests ++= Utils.randomize(remoteRequests)

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests.size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks
    //處理本地Block
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }

本地Block處理

ShuffleBlockFetcherIterator.fetchLocalBlocks()
拉取本地Block
對localBlocks進行迭代，

  /**
   * Fetch the local blocks while we are fetching remote blocks. This is ok because
   * [[ManagedBuffer]]'s memory is allocated lazily when we create the input stream, so all we
   * track in-memory are the ManagedBuffer references themselves.
   */
  private[this] def fetchLocalBlocks() {
    val iter = localBlocks.iterator
    while (iter.hasNext) {
      val blockId = iter.next()
      try {
      //BlockManager.getBlockData 
      //buf等於  (ShuffleBlockId,Block資料檔案偏移量，長度，資料檔案位置)
        val buf = blockManager.getBlockData(blockId)
        shuffleMetrics.incLocalBlocksFetched(1)
        shuffleMetrics.incLocalBytesRead(buf.size)
        buf.retain()
        results.put(new SuccessFetchResult(blockId, blockManager.blockManagerId, 0, buf))
      } catch {
        case e: Exception =>
          // If we see an exception, stop immediately.
          logError(s"Error occurred while fetching local blocks", e)
          results.put(new FailureFetchResult(blockId, blockManager.blockManagerId, e))
          return
      }
    }
  }

BlockManager.getBlockData



 /**
   * Interface to get local block data. Throws an exception if the block cannot be found or
   * cannot be read successfully.
   */
  override def getBlockData(blockId: BlockId): ManagedBuffer = {
    if (blockId.isShuffle) {
    //shuffleManager.shuffleBlockResolver得到 IndexShuffleBlockResolver
    //呼叫IndexShuffleBlockResolver.getBlockData
    //得到 FileSegmentManagedBuffer的物件，該物件存放在當前(ShuffleBlockId,Block資料檔案偏移量，長度，資料檔案位置)
      shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
    } else {
      val blockBytesOpt = doGetLocal(blockId, asBlockResult = false)
        .asInstanceOf[Option[ByteBuffer]]
      if (blockBytesOpt.isDefined) {
        val buffer = blockBytesOpt.get
        new NioManagedBuffer(buffer)
      } else {
        throw new BlockNotFoundException(blockId.toString)
      }
    }
  }

IndexShuffleBlockResolver.getBlockData


  override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
    // The block is actually going to be a range of a single map output file for this map, so
    // find out the consolidated file, then the offset within that from our index
    val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId)

    val in = new DataInputStream(new FileInputStream(indexFile))
    try {
      ByteStreams.skipFully(in, blockId.reduceId * 8)
      val offset = in.readLong()
      val nextOffset = in.readLong()
      //(ShuffleBlockId,Block資料檔案偏移量，長度，資料檔案位置)
      new FileSegmentManagedBuffer(
        transportConf,
        getDataFile(blockId.shuffleId, blockId.mapId),
        offset,
        nextOffset - offset)
    } finally {
      in.close()
    }
  }

遠端Block處理

ShuffleBlockFetcherIterator.initialize


  private[this] def initialize(): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener(_ => cleanup())

    // Split local and remote blocks.
    // 折分本地Block和遠端Block
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    fetchRequests ++= Utils.randomize(remoteRequests)

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests.size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }

ShuffleBlockFetcherIterator.splitLocalRemoteBlocks


private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
    // Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
    // smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
    // nodes, rather than blocking on reading output from one node.
    val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
    logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize)

    // Split local and remote blocks. Remote blocks are further split into FetchRequests of size
    // at most maxBytesInFlight in order to limit the amount of data in flight.
    val remoteRequests = new ArrayBuffer[FetchRequest]

    // Tracks total number of blocks (including zero sized blocks)
    var totalBlocks = 0
    //blocksByAddress是mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition)的返回值[(BlockManagerId,[(ShuffleBlockId,長度)])]
    
    for ((address, blockInfos) <- blocksByAddress) {
      totalBlocks += blockInfos.size
      if (address.executorId == blockManager.blockManagerId.executorId) {
        // Filter out zero-sized blocks
        //把本地Block,放到變數localBlocks   過濾，長度不為空的，就是有資料的Block,元素的表現為:   (ShuffleBlockId,長度)])
        localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
        numBlocksToFetch += localBlocks.size
      } else {
        val iterator = blockInfos.iterator
        var curRequestSize = 0L
        var curBlocks = new ArrayBuffer[(BlockId, Long)]
        while (iterator.hasNext) {
          val (blockId, size) = iterator.next()
          // Skip empty blocks
          if (size > 0) {
            curBlocks += ((blockId, size))
            remoteBlocks += blockId
            numBlocksToFetch += 1
            curRequestSize += size
          } else if (size < 0) {
            throw new BlockException(blockId, "Negative block size " + size)
          }
          if (curRequestSize >= targetRequestSize) {
            // Add this FetchRequest
            remoteRequests += new FetchRequest(address, curBlocks)
            curBlocks = new ArrayBuffer[(BlockId, Long)]
            logDebug(s"Creating fetch request of $curRequestSize at $address")
            curRequestSize = 0
          }
        }
        // Add in the final request
        if (curBlocks.nonEmpty) {
         //把遠端Block,放到變數remoteRequests
          // var curBlocks = new ArrayBuffer[(BlockId, Long)]
          remoteRequests += new FetchRequest(address, curBlocks)
        }
      }
    }
    logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
    remoteRequests
  }

ShuffleBlockFetcherIterator.fetchUpToMaxBytes


  private def fetchUpToMaxBytes(): Unit = {
    // Send fetch requests up to maxBytesInFlight
    while (fetchRequests.nonEmpty &&
      (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) {
      //把fetchRequests.中的第一個元素，作為請求引數呼叫sendRequest()傳送過去
      //元素型別為FetchRequest(address, Array((blockId, size)))
      sendRequest(fetchRequests.dequeue())
    }
  }

ShuffleBlockFetcherIterator.sendRequest(),成功後呼叫方法onBlockFetchSuccess(),並把結果放到 results


private[this] def sendRequest(req: FetchRequest) {
    logDebug("Sending request for %d blocks (%s) from %s".format(
      req.blocks.size, Utils.bytesToString(req.size), req.address.hostPort))
    bytesInFlight += req.size

    // so we can look up the size of each blockID
    val sizeMap = req.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMap
    val blockIds = req.blocks.map(_._1.toString)

    val address = req.address
    shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
      new BlockFetchingListener {
        override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
          // Only add the buffer to results queue if the iterator is not zombie,
          // i.e. cleanup() has not been called yet.
          if (!isZombie) {
            // Increment the ref count because we need to pass this to a different thread.
            // This needs to be released after use.
            buf.retain()
            results.put(new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))
            shuffleMetrics.incRemoteBytesRead(buf.size)
            shuffleMetrics.incRemoteBlocksFetched(1)
          }
          logTrace("Got remote block " + blockId + " after " + Utils.getUsedTimeMs(startTime))
        }

        override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
          logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
          results.put(new FailureFetchResult(BlockId(blockId), address, e))
        }
      }
    )
  }

ShuffleBlockFetcherIterator.next()方法,返回 (currentResult.blockId, new BufferReleasingInputStream(input, this))


 override def next(): (BlockId, InputStream) = {
    numBlocksProcessed += 1

    var result: FetchResult = null
    var input: InputStream = null
    // Take the next fetched result and try to decompress it to detect data corruption,
    // then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch
    // is also corrupt, so the previous stage could be retried.
    // For local shuffle block, throw FailureFetchResult for the first IOException.
    while (result == null) {
      val startFetchWait = System.currentTimeMillis()
      result = results.take()
      val stopFetchWait = System.currentTimeMillis()
      shuffleMetrics.incFetchWaitTime(stopFetchWait - startFetchWait)

      result match {
        case r @ SuccessFetchResult(blockId, address, size, buf) =>
          if (address != blockManager.blockManagerId) {
            shuffleMetrics.incRemoteBytesRead(buf.size)
            shuffleMetrics.incRemoteBlocksFetched(1)
          }
          bytesInFlight -= size

          val in = try {
            buf.createInputStream()
          } catch {
            // The exception could only be throwed by local shuffle block
            case e: IOException =>
              assert(buf.isInstanceOf[FileSegmentManagedBuffer])
              logError("Failed to create input stream from local block", e)
              buf.release()
              throwFetchFailedException(blockId, address, e)
          }

          input = streamWrapper(blockId, in)
          // Only copy the stream if it's wrapped by compression or encryption, also the size of
          // block is small (the decompressed block is smaller than maxBytesInFlight)
          if (detectCorrupt && !input.eq(in) && size < maxBytesInFlight / 3) {
            val originalInput = input
            val out = new ChunkedByteBufferOutputStream(64 * 1024, ByteBuffer.allocate)
            try {
              // Decompress the whole block at once to detect any corruption, which could increase
              // the memory usage tne potential increase the chance of OOM.
              // TODO: manage the memory used here, and spill it into disk in case of OOM.
              Utils.copyStream(input, out)
              out.close()
              input = out.toChunkedByteBuffer.toInputStream(dispose = true)
            } catch {
              case e: IOException =>
                buf.release()
                if (buf.isInstanceOf[FileSegmentManagedBuffer]
                  || corruptedBlocks.contains(blockId)) {
                  throwFetchFailedException(blockId, address, e)
                } else {
                  logWarning(s"got an corrupted block $blockId from $address, fetch again", e)
                  corruptedBlocks += blockId
                  fetchRequests += FetchRequest(address, Array((blockId, size)))
                  result = null
                }
            } finally {
              // TODO: release the buf here to free memory earlier
              originalInput.close()
              in.close()
            }
          }

        case FailureFetchResult(blockId, address, e) =>
          throwFetchFailedException(blockId, address, e)
      }

      // Send fetch requests up to maxBytesInFlight
      fetchUpToMaxBytes()
    }

    currentResult = result.asInstanceOf[SuccessFetchResult]
    (currentResult.blockId, new BufferReleasingInputStream(input, this))
  }

end

Spark BlockManager原始碼分析

Spark BlockManager原始碼分析更多資源分享 SPARK 原始碼分析技術分享(視訊彙總套裝視訊): https://www.bilibili.com/video/av37442139/ github: https://github.com/opensou

spark的儲存系統--BlockManager原始碼分析

spark的儲存系統--BlockManager原始碼分析根據之前的一系列分析，我們對spark作業從建立到排程分發，到執行，最後結果回傳driver的過程有了一個大概的瞭解。但是在分析原始碼的過程中也留下了大量的問題，最主要的就是涉及到的spark中重要的幾個基礎模組，我們對這些基礎設施的內部細節並不是很

Spark core原始碼分析之spark叢集的啟動（二）

2.2 Worker的啟動 org.apache.spark.deploy.worker 1 從Worker的伴生物件的main方法進入在main方法中首先是得到一個SparkConf例項conf，然後將conf和啟動Worker傳入的引數封裝得到Wor

《Spark核心原始碼分析與開發實戰》讀書筆記之一

第1章 Spark系統概述 1.1 Spark是什麼 1. Spark比Hadoop快在哪裡（1）Spark使用記憶體計算，而Hadoop使用IO （2）Hadoop的計算是按部就班一步一步進行的，而Spark則是提前生成了DAG，優化了運算路徑 1.2 Sp

Spark MapOutputTracker原始碼分析

Spark MapOutputTracker原始碼分析更多資源分享 SPARK 原始碼分析技術分享(視訊彙總套裝視訊): https://www.bilibili.com/video/av37442139/ github: https://github.com/opensourceteams

Spark SQL 原始碼分析之Physical Plan 到 RDD的具體實現

我們都知道一段sql，真正的執行是當你呼叫它的collect()方法才會執行Spark Job，最後計算得到RDD。 lazy val toRdd: RDD[Row] = executedPlan.execute() Spark Plan基本包含4種操作型別，即Bas

Spark MLlib原始碼分析—Word2Vec原始碼詳解

以下程式碼是我依據SparkMLlib(版本1.6)中Word2Vec原始碼改寫而來，基本算是照搬。此版Word2Vec是基於Hierarchical Softmax的Skip-gram模型的實現。在決定讀懂原始碼前，博主建議讀者先看一下《Word2Vec_

spark mllib原始碼分析之二分類邏輯迴歸evaluation

在邏輯迴歸分類中，我們評價分類器好壞的主要指標有精準率（precision），召回率（recall），F-measure，AUC等，其中最常用的是AUC，它可以綜合評價分類器效能，其他的指標主要偏重一些方面。我們介紹下spark中實現的這些評價指標，便於使用sp

Spark-ThriftServer原始碼分析

Spark1.1之後的版本引入了ThriftServer和CLI，使得Hive使用者和RDBMS使用者可以直接通過JDBC方式提交SQL至Spark執行而無需編寫sparksql程式碼，下面對spark-thriftserver的原始碼進行簡單分析。執行${SPARK_H

spark mllib原始碼分析之隨機森林(Random Forest)（二）

4. 特徵處理這部分主要在DecisionTree.scala的findSplitsBins函式，將所有特徵封裝成Split，然後裝箱Bin。首先對split和bin的結構進行說明 4.1. 資料結構 4.1.1. Split cl

spark mllib原始碼分析之DecisionTree與GBDT

我們在前面的文章講過，在spark的實現中，樹模型的依賴鏈是GBDT-> Decision Tree-> Random Forest，前面介紹了最基礎的Random Forest的實現，在此基礎上我們介紹Decision Tree和GBDT的實現

spark mllib原始碼分析之L-BFGS（一）

1. 使用 spark給出的example中涉及到LBFGS有兩個，分別是LBFGSExample.scala和LogisticRegressionWithLBFGSExample.scala，第一個是直接使用LBFGS直接訓練，需要指定一系列優化引數，優

spark core原始碼分析15 Shuffle詳解－寫流程

Shuffle是一個比較複雜的過程，有必要詳細剖析一下內部寫的邏輯 ShuffleManager分為SortShuffleManager和HashShuffleManager 一、SortShu

spark mllib原始碼分析之隨機森林(Random Forest)（三）

6. 隨機森林訓練 6.1. 資料結構 6.1.1. Node 樹中的每個節點是一個Node結構 class Node @Since("1.2.0") ( @Since("1.0.0") val id: Int, @S

spark mllib原始碼分析之邏輯迴歸彈性網路ElasticNet（一）

spark在ml包中將邏輯迴歸封裝了下，同時在演算法中引入了L1和L2正則化，通過elasticNetParam來調節兩種正則化的係數，同時根據選擇的正則化，決定使用L-BFGS還是OWLQN優化，是謂Elastic Net。 1. 輔助類我們首先介紹

Spark MLlib原始碼分析—TFIDF原始碼詳解

以下程式碼是我依據SparkMLlib(版本1.6) 1、HashingTF 是使用雜湊表來儲存分詞，並計算分詞頻數（TF），生成HashMap表。在Map中，K為分詞對應索引號，V為分詞的頻數。在宣告HashingTF 時，需要設定numFeatures，該

Spark:BlockManager原理剖析與原始碼分析

BlockManager是Spark的分散式儲存系統，與我們平常說的分散式儲存系統是有區別的，區別就是這個分散式儲存系統只會管理Block塊資料，它執行在所有節點上。BlockManager的結構是Maser-Slave架構，Master就是Driver上的BlockManagerMaste

Spark BlockManager原理與原始碼分析

1、BlockManager原理示意圖①Driver上的BlockManagerMaster管理各個節點上BlockManager的元資料資訊和維護block的狀態資訊。②每個節點上BlockManager的每個元件： DiskStore：負責磁碟上的資料讀寫

Spark原始碼分析之Spark Shell（上）

https://www.cnblogs.com/xing901022/p/6412619.html 文中分析的spark版本為apache的spark-2.1.0-bin-hadoop2.7。 bin目錄結構： -rwxr-xr-x. 1 bigdata bigdata 1089 Dec

大資料之Spark（三）--- Spark核心API，Spark術語，Spark三級排程流程原始碼分析

一、Spark核心API ----------------------------------------------- [SparkContext] 連線到spark叢集,入口點. [HadoopRDD] extends RDD 讀取hadoop

Spark BlockManager原始碼分析

Spark BlockManager原始碼分析

更多資源分享

前置條件

技能標籤

描述

ShuffleRDD讀取資料

本地Block

遠端Block

ShuffleRDD

本地Block處理

遠端Block處理

相關推薦