Spark-2.4.0原始碼：sparkContext

阿新 • • 發佈：2019-05-30

　　在看sparkContext之前，先回顧一下Scala的語法。Scala建構函式分主構造和輔建構函式，輔建構函式是關鍵字def+this定義的，而類中不在方法體也不在輔建構函式中的程式碼就是主建構函式，例項化物件的時候主建構函式都會被執行，例：

class person(name String,age Int){
    println("主建構函式被呼叫")

    def this(name String,age Int){ //輔建構函式
        this ()    //必須先呼叫主建構函式
        this.name = name
        this.age = age
    }
    
    def introduce(){
        println("name :" + name + "-age :" + age)
    }
}

val jack = new person("jack",2)
 

jack.introduce()

　　執行結果：

　　主建構函式被呼叫

　　name :jack-age :2

　　切入正題，看sparkContext的主建構函式比較重要的一些程式碼：

try{
        ...
        // Create the Spark execution environment (cache, map output tracker, etc)
        _env = createSparkEnv(_conf, isLocal, listenerBus)
        SparkEnv.set(_env)
        
        ...
        
        // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
        // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
        _heartbeatReceiver = env.rpcEnv.setupEndpoint(
        HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

        
        // Create and start the scheduler
        val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
        _schedulerBackend = sched
        _taskScheduler = ts
        _dagScheduler = new DAGScheduler(this)
        _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
    
        // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
        // constructor
        _taskScheduler.start()
    }

　　首先：

　　 _env = createSparkEnv(_conf, isLocal, listenerBus)
    SparkEnv.set(_env)

　　 _heartbeatReceiver = env.rpcEnv.setupEndpoint(
    HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

　　這裡是在sparkContext中建立rpcEnv，並通過 setupEndpoint 向 rpcEnv 註冊一個心跳的 Endpoint。

　　接著：

 val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)

　　調的sparkContext自己的方法，建立taskScheduler，返回的是一個 (SchedulerBackend, TaskScheduler) 元組

private def createTaskScheduler(
      sc: SparkContext,
      master: String,
      deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
        //...
        
        //standalone的提交模式
      case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        //呼叫初始化方法
        scheduler.initialize(backend)
        (backend, scheduler)
    }
    
        //...
    }

　　方法內部根據master引數判斷不同的提交模式，建立不同的(SchedulerBackend, TaskScheduler) ，拿standalon模式舉例，根據入參建立TaskSchedulerImpl和StandalonSchedulerBackend，再呼叫TaskSchedulerImpl的初始化方法，最後返回一個元組。

　　 scheduler.initialize(backend)，其實就是根據不同的schedulingMode建立不同的schedulableBuilder，它就是對Scheduleable tree的封裝,負責對taskSet的排程。

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

　　接著下面兩行程式碼：

　　_dagScheduler = new DAGScheduler(this)

　　建立DAG有向無環圖，實現類面向stage的排程機制的高層次排程層，他會為每個stage計算DAG(有向無環圖)，追蹤RDD和stage的輸出是否被物化(寫入磁碟或記憶體)，並且尋找一個最少消耗的排程機制來執行job。它會將stage作為taskSets提交到底層的TaskSchedulerImpl上來在叢集執行。除了處理stage的DAG，它還負責決定執行每個task的最佳位置，基於當前的快取狀態，將最佳位置提交給底層的TaskSchedulerImpl，此外，他會處理由於每個shuffle輸出檔案導致的失敗，在這種情況下舊的stage可能會被重新提交。一個stage內部的失敗，如果不是由於shuffle檔案丟失導致的失敗，會被taskScheduler處理，它會多次重試每個task，還不行才會取消整個stage。

  _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

　　在上面建立好了TaskScheduler和SchedulerBackend後，告訴 HeartbeatReceiver（心跳）的監聽端。

　　最後：

　　_taskScheduler.start()

　　在TaskSchedulerImpl的start()方法中調的是SchedulerBackend的start()方法，所以start()方法執行的是這段：

override def start() {
    super.start()

    // SPARK-21159. The scheduler backend should only try to connect to the launcher when in client
    // mode. In cluster mode, the code that submits the application to the Master needs to connect
    // to the launcher instead.
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }

    //引數設定
    
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

　　這裡建立了兩個物件：AppliactionDescription和AppClient，AppliactionDescription顧名思義就是對Application的描述類，比如它需要的資源等；AppClient負責負責為application與spark叢集通訊。SchedulerBackend的start()最終呼叫了AppClient的start()，程式碼如下：

def start() {
    // Just launch an rpcEndpoint; it will call back into the listener.
    endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
  }

　　啟動一個rpcEndPoint並回調給監聽器，RPC原理可看這篇 https://www.cnblogs.com/superhedantou/p/7570692.html

　　最後畫個大概流程圖

Spark-2.4.0原始碼：sparkContext

Spark-2.4.0原始碼：sparkContext

基於CentOS6.4環境編譯Spark-2.1.0原始碼

[Spark版本更新]--Spark-2.4.0 釋出說明

spark-2.4.0-hadoop2.7-安裝部署

spark-2.4.0-hadoop2.7-高可用(HA)安裝部署

spark-2.4.0-hadoop2.7-簡單操作

Spark 2.4.0程式設計指南--Spark DataSources

Spark 2.4.0 整合Hive 1.2.1

Spark 2.4.0程式設計指南--Spark SQL UDF和UDAF

Spark 2.4.0程式設計指南--spark dataSet action

Spark 2.4.0 程式設計指南--快速入門

Spark 2.4.0 standalone 模式安裝

hadoop-2.4.0原始碼編譯過程

GCC升級問題解決：configure: error: Building GCC requires GMP 4.2+, MPFR 2.4.0+ and MPC 0.8.0+，mpfr2.4.0完整解決

Spark RDD的預設分割槽數：（spark 2.1.0）

mono-3.4.0 原始碼安裝時出現的問題 [do-install] Error 2 [install-pcl-targets] Error 1 解決方法

Intellij IDE+ Spark 2.4 例子執行（原始碼斷點跟蹤）

Spark-2.1.0-hadooop-2.6.0-cdh5.7.0原始碼編譯

CDH 5.12.0 中使用 spark 2.4.2

spark-2.4.3-bin-without-hadoop 啟動報錯 failed to launch: nice -n 0

Spark-2.4.0原始碼：sparkContext

相關推薦