Spark核心原始碼深度剖析：SparkContext原理剖析與原始碼分析

阿新 • • 發佈：2019-01-22

1.SparkContex原理剖析

1.圖解：
這裡寫圖片描述

2.SparkContext原始碼分析

1.TaskScheduler建立：
SparkContext.scala

 // Create and start the scheduler
  private[spark] var (schedulerBackend, taskScheduler) =
    SparkContext.createTaskScheduler(this, master)

//不同的提交模式，會建立不同的TaskScheduler
private def createTaskScheduler(
      sc: SparkContext,
      master: String): (SchedulerBackend, TaskScheduler) = {
 master match 
 {
        //standalone模式
        case SPARK_REGEX(sparkUrl) =>
        //TaskSchedulerImpl（）底層通過操作一個SchedulerBackend,針對不同的種類的cluster（standalone、yarn和mesos），排程task。
    //他也可以通過使用一個LocalBackend，並且將isLocal引數設定為true，來在本地模式下工作。
    //它負責處理一些童工的資源，比如決定多個job的排程順序，啟動檢查任務執行
    //客戶端首先應用排程initialize()方法和start（）方法，然後通過runTasks()方法提交task sets 

        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        (backend, scheduler)

}

TaskSchedulerImpl.scala

def initialize(backend: SchedulerBackend) {
    this 
.backend = backend
    // temporarily set rootPool name to empty
    rootPool = new Pool("", schedulingMode, 0, 0)
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
      }
    }
    schedulableBuilder.buildPools()
  }

1.TaskScheduler啟動：
TaskSchedulerImpl.scala

override def start() {
//重點是呼叫了SparkDeploySchedulerBackend類的start
    backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      import sc.env.actorSystem.dispatcher
      sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds,
            SPECULATION_INTERVAL milliseconds) {
        Utils.tryOrExit { checkSpeculatableTasks() }
      }
    }
  }

SparkDeploySchedulerBackend.scala

override def start() {
    super.start()

    // The endpoint for executors to talk to us
    val driverUrl = AkkaUtils.address(
      AkkaUtils.protocol(actorSystem),
      SparkEnv.driverActorSystemName,
      conf.get("spark.driver.host"),
      conf.get("spark.driver.port"),
      CoarseGrainedSchedulerBackend.ACTOR_NAME)
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)

    // When testing, expose the parent class path to the child. This is processed by
    // compute-classpath.{cmd,sh} and makes all needed jars available to child processes
    // when the assembly is built with the "*-provided" profiles enabled.
    val testingClassPath =
      if (sys.props.contains("spark.testing")) {
        sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq
      } else {
        Nil
      }

    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
    //ApplicationDescription非常重要，它代表了當前的這個
    //application的一切情況
    //包括application最大需要多少CPU core，每個slave上需要多少記憶體
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec)
    //建立APPClient
    //APPClient是一個介面，它負責為application與Spark叢集進行通  
   //信。它會接收一個Spark Master的URL，以及一個application，和 
   //一個叢集事件的監聽器，以及各種事件發生時監聽器的回撥函式             
    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
    client.start()

    waitForRegistration()
  }

2.DAGScheduler建立：
SparkContext.scala

 @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
  //DAGScheduler類實現了面向stage的排程機制的高層次的排程層，他會為每個job計算一個stage的DAG（有向無環圖）,
//追蹤RDD和stage的輸出是否被物化了（物化就是說，寫入了磁碟或者記憶體等地方），並且尋找一個最少
//消耗(最優、最小）排程機制來執行job，它會將stage作為tasksets提交到底層的TaskSchedulerImple上，
//來在叢集上執行它們（task）
//除了處理stage的DAG，它還負責決定執行每個task的最佳位置，基於當前的快取狀態，將這些最佳位置提交底層的
//TaskSchedulerImpl。此外，它會處理理由於shuffle輸出檔案丟失導致的失敗，在這種情況下，舊的stage可能就會
//被重新提交，一個stage內部的失敗，如果不是由於shuffle檔案丟失所導致的，會被TAskSchedule處理，它會多次重試
//每一個task，直到最後，實在是不行了，才會去取消整個stage
    dagScheduler = new DAGScheduler(this)
  } catch {
    case e: Exception => {
      try {
        stop()
      } finally {
        throw new SparkException("Error while constructing DAGScheduler", e)
      }
    }
  }

3.SparkUI的建立：
SparkContext.scala

// Initialize the Spark UI
  private[spark] val ui: Option[SparkUI] =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(SparkUI.createLiveUI(this, conf, listenerBus, jobProgressListener,
        env.securityManager,appName))
    } else {
      // For tests, do not enable the UI
      None
    }

SparkUI.scala

//預設埠
val DEFAULT_PORT = 4040
def createLiveUI(
      sc: SparkContext,
      conf: SparkConf,
      listenerBus: SparkListenerBus,
      jobProgressListener: JobProgressListener,
      securityManager: SecurityManager,
      appName: String): SparkUI =  {
    create(Some(sc), conf, listenerBus, securityManager, appName,
      jobProgressListener = Some(jobProgressListener))
  }

Spark核心原始碼深度剖析：SparkContext原理剖析與原始碼分析

1.SparkContex原理剖析 1.圖解： 2.SparkContext原始碼分析 1.TaskScheduler建立： SparkContext.scala // Create and start the scheduler p

Spark:SparkContext原理剖析與原始碼分析

在Spark中，SparkContext是Spark所有功能的一個入口，你無論是用java、scala，甚至是python編寫都必須要有一個SparkContext，它的主要作用，包括初始化Spark應用程式所需的一些核心元件，包括排程器（DAGSchedule、TaskScheduler

大話Spark(6)-原始碼之SparkContext原理剖析

SparkContext是整個spark程式通往叢集的唯一通道，他是程式的起點，也是程式的終點。我們的每一個spark個程式都需要先建立SparkContext，接著呼叫SparkContext的方法，比如說 sc.textFile(filepath)，程式最後也會呼叫sc.stop（）來退出。讓我們來

Spark核心架構深度剖析

1(圖中流程標號)：當我們在strandalone模式下提交我們的spark應用時，會建立和構造一個DriverActor程序出來，這個程序會執行我們的Application應用程式。。。 2：寫過spark應用程式的都知道，第一步需要構造SparkConf然後建立

Spark:核心架構深度剖析

1、術語定義 1、Application：Spark應用程式指的是使用者編寫的Spark應用程式，包含了Driver功能程式碼和分佈在叢集中多個節點上執行的Executor程式碼。 Spark應用程式，由一個或多個作業JOB組成，如下圖所示: 2、Driver：驅動程式 S

微服務核心元件 Zuul 閘道器原理剖析 | 併發程式設計網

一、前言 Zuul 閘道器是具體核心業務服務的看門神，相比具體實現業務的系統服務來說它是一個邊緣服務，主要提供動態路由，監控，彈性，安全性等功能。在分散式的微服務系統中，系統被拆為了多套系統，通過zuul閘道器來對使用者的請求進行路由，轉發到具體的後臺服務系統中。本 Chat 主要內容如下：

Spring原始碼窺探之：擴充套件原理BeanDefinitionRegistryPostProcessor

BeanDefinitionRegistryPostProcessor繼承自BeanFactoryPostProcessor，其中有兩個介面，postProcessBeanDefinitionRegistry是BeanDefinitionRegistryPostProcessor自帶的，postProcess

哇咔咔乾貨來啦：PowerJob 原理剖析之 Akka Toolkit

> 本文適合有 Java 基礎知識的人群 ![](https://img2020.cnblogs.com/blog/759200/202007/759200-20200728215603789-477620477.png) 作者：HelloGitHub-**Salieri** HelloGit

原始碼系列Spring，Mybatis，Springboot，Netty原始碼深度解析-Spring的整體架構與容器的基本實現-mybatis原始碼深度解析與最佳實踐

6套原始碼系列Spring，Mybatis，Springboot，Netty原始碼深度解析視訊課程 6套原始碼套餐課程介紹： 1、6套精品是掌櫃最近整理出的最新課程，都是當下最火的技術，最火的課程，也是全網課程的精品； 2、6套資源包含：全套完整

Caffe原始碼理解3：Layer基類與template method設計模式

目錄寫在前面 template method設計模式 Layer 基類 Layer成員變數構造與析構 SetUp成員函式前向傳播與反向傳播其他成員函式參考部落格：blog.shinelee.me | 部落格園 | CS

Caffe原始碼理解1：Blob儲存結構與設計

Blob作用 A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability b

技術轉載：ListView原理學習與優化總結

在整理前幾篇文章的時候有朋友提出寫一下ListView的效能優化方面的東西，這個問題也是小馬在面試過程中被別人問到的…..今天小馬就藉此機會來整理下，網上類似的資料蠻多的，倒不如自己寫一篇，記錄在這個地方，供自己以後使用，不用再翻來翻去的找了，用自己寫的…呵呵，不多講其

深度解讀：數學的本質與宇宙萬物的關聯--數學的本質是一門語言

捕捉形狀異常博大精深虛擬機會傳遞周期出現本文將會以數學為主線和切入點，進行一次思想和思維上的自由推理探索認知之旅，並將會透過一種有趣視角和觀點，窺見一個我們所熟悉又有所不同的世界、萬事萬物和我們的宇宙。現在就讓我們開始這次認知探索的旅程吧。概述

NNs（Neural Networks，神經網路）和Polynomial Regression（多項式迴歸）等價性之思考，以及深度模型可解釋性原理研究與案例

1. Main Point 0x1：行文框架第二章：我們會分別介紹NNs神經網路和PR多項式迴歸各自的定義和應用場景。第三章：討論NNs和PR在數學公式上的等價性，NNs和PR是兩個等價的理論方法，只是用了不同的方法解決了同一個問題，這樣我們就形成了一個統一的觀察視角，不再將深度神經網路看成是一

Apache Flink 進階（三）：Checkpoint 原理解析與應用實踐

大家好，今天我將跟大家分享一下 Flink 裡面的 Checkpoint，共分為四個部分。首先講一下 Checkpoint 與 s

Spring 原始碼學習 04：初始化容器與 DefaultListableBeanFactory

#### 前言 ![](https://cdn.jsdelivr.net/gh/liuzhihang/oss/pic/article/Sf8fRH-D6nt5n.png) 在前一篇文章：[建立 IoC 容器的幾種方式](https://mp.weixin.qq.com/s/V7SjmIFKAXyppBF_

Java併發包原始碼學習系列：AQS共享式與獨佔式獲取與釋放資源的區別

[toc] # Java併發包原始碼學習系列：AQS共享模式獲取與釋放資源往期回顧： - [Java併發包原始碼學習系列：AbstractQueuedSynchronizer](https://www.cnblogs.com/summerday152/p/14238284.html) - [Java併

Java併發包原始碼學習系列：JDK1.8的ConcurrentHashMap原始碼解析

[toc] 系列傳送門： - [Java併發包原始碼學習系列：AbstractQueuedSynchronizer](https://blog.csdn.net/Sky_QiaoBa_Sum/article/details/112254373) - [Java併發包原始碼學習系列：CLH同步佇列及同步資源

Java併發包原始碼學習系列：執行緒池ThreadPoolExecutor原始碼解析

Java併發包原始碼學習系列：執行緒池ScheduledThreadPoolExecutor原始碼解析

[toc] ## ScheduledThreadPoolExecutor概述我們在上一篇學習了ThreadPoolExecutor的實現原理：[Java併發包原始碼學習系列：執行緒池ThreadPoolExecutor原始碼解析](https://blog.csdn.net/Sky_QiaoBa_Sum

Spark核心原始碼深度剖析：SparkContext原理剖析與原始碼分析

1.SparkContex原理剖析

2.SparkContext原始碼分析

相關推薦