Spark job 提交

阿新 • • 發佈：2018-12-04

Driver 側在任務提交的時候要完成以下幾個工作

RDD依賴分析，以生成DAG
根據DAG 將job 分割成多個 stage
stage 一經確認，即生成相應的 task，將生成的task 分發到 Executor 執行

提交的實現入口在SparkContext.scala

/**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

呼叫下一個 runJob

/**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

再下一個 runJob

/**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   * The function that is run against each partition additionally takes `TaskContext` argument.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

再下一個runJob，呼叫 dagScheduler.runJob

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

進入dagScheduler.runJob

DAGScheduler.scala

/**
   * Run an action job on the given RDD and pass all the results to the resultHandler function as
   * they arrive.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @note Throws `Exception` when the job fails
   */
  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

submitJob，用來提交一個job，到 job scheduler
DAGScheduler.scala

  /**
   * Submit an action job to the scheduler.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @return a JobWaiter object that can be used to block until the job finishes executing
   *         or can be used to cancel the job.
   *
   * @throws IllegalArgumentException when partitions ids are illegal
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

eventProcessLoop 是 DAGSchedulerEventProcessLoop 的例項。 DAGSchedulerEventProcessLoop 是處理訊息事件

eventProcessLoop中呼叫doOnReceive來進行監聽
DAGScheduler.scala

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

    case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
      dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

    case StageCancelled(stageId, reason) =>
      dagScheduler.handleStageCancellation(stageId, reason)

    case JobCancelled(jobId, reason) =>
      dagScheduler.handleJobCancellation(jobId, reason)

    case JobGroupCancelled(groupId) =>
      dagScheduler.handleJobGroupCancelled(groupId)

    case AllJobsCancelled =>
      dagScheduler.doCancelAllJobs()

    case ExecutorAdded(execId, host) =>
      dagScheduler.handleExecutorAdded(execId, host)

    case ExecutorLost(execId, reason) =>
      val workerLost = reason match {
        case SlaveLost(_, true) => true
        case _ => false
      }
      dagScheduler.handleExecutorLost(execId, workerLost)

    case WorkerRemoved(workerId, host, message) =>
      dagScheduler.handleWorkerRemoved(workerId, host, message)

    case BeginEvent(task, taskInfo) =>
      dagScheduler.handleBeginEvent(task, taskInfo)

    case SpeculativeTaskSubmitted(task) =>
      dagScheduler.handleSpeculativeTaskSubmitted(task)

    case GettingResultEvent(taskInfo) =>
      dagScheduler.handleGetTaskResult(taskInfo)

    case completion: CompletionEvent =>
      dagScheduler.handleTaskCompletion(completion)

    case TaskSetFailed(taskSet, reason, exception) =>
      dagScheduler.handleTaskSetFailed(taskSet, reason, exception)

    case ResubmitFailedStages =>
      dagScheduler.resubmitFailedStages()
  }

handleJobSubmitted 接下來見我這篇部落格 :
https://blog.csdn.net/zhixingheyi_tian/article/details/83379938

spark job提交7

當task在executor上執行時最終會在taskrunner中呼叫execBackend.statusUpdate來向driver端傳送狀態更新 \spark-master\core\src\main\scala\org\apache\spark\executor\CoarseGrained

spark job提交6

driver端呼叫launchTasks來向worker節點中的executor傳送啟動任務命令 spark-master\core\src\main\scala\org\apache\spark\scheduler\cluster\CoarseGrainedSchedulerBackend.s

Spark job 提交

Driver 側在任務提交的時候要完成以下幾個工作 RDD依賴分析，以生成DAG 根據DAG 將job 分割成多個 stage stage 一經確認，即生成相應的 task，將生成的task 分發到 Executor 執行提交的實現入口在SparkCont

spark job提交：spark-submit

一般化的引數： spark-submit \ --supervise #driver失敗時重啟 \ --name wordcount \ --master yarn-cluster \ --num-executors 100 \ --executor-m

spark job提交過程

Driver的任務提交過程 1、Driver程式的程式碼執行到action操作，觸發了SparkContext的runJob方法。 2、SparkContext呼叫DAGScheduler的runJob函式。 3、DAGScheduler把Job劃分s

[Spark Core] Spark Client Job 提交三級調度框架

bsp res track cati ive trac htm action 面向 0. 說明　　官方文檔 Job Scheduling 　　Spark 調度核心組件: DagScheduler TaskScheduler BackendScheduler

Spark 觸發Job提交

Spark 觸發Job提交更多資源 github: https://github.com/opensourceteams/spark-scala-maven csdn(彙總視訊線上看): https://blog.csdn.net/thinktothings/arti

使用REST API提交Apache Spark Job

使用REST API提交Apache Spark Job 使用Apache spark時，有時需要從群集外部按需觸發Spark作業。我們可以通過兩種方式在群集中提交Apache spark作業。 S

Spark啟動流程與job提交流程

Driver端首先啟動SparkSubmit程序，啟動後開始於Master進行通訊，此時建立了了一個非常重要的物件（SparkContext），接著向Master傳送任務資訊； Master接收到資訊後。開始資源排程，此時會和所有的Worker進行通訊，找到比較空閒的Worker，並通知Worker來取任務

Spark-利用SparkLauncher 類以JAVA API 程式設計的方式提交spark job

一.環境說明和使用軟體的版本說明: hadoop-version:hadoop-2.9.0.tar.gz spark-version:spark-2.2.0-bin-hadoop2.7.tgz java-version:jdk1.8.0_151 叢集環境：單機偽分散式環

利用spring-boot構建spark job作業提交服務

版本依賴 spark.version：2.1.0 hadoop.version：2.6.5 springboot-mybatis.version：1.1.1 springboot：1.5.10

Spark Job具體的物理執行

oca rds 物理 cer mapper shu cal compute 並發即使采用pipeline的方式，函數f對依賴的RDD中的數據集合的操作也會有兩種方式： 1.f(record)，f作用於集合的每一條記錄，每次只作用於一條記錄 2.f(records)，f一次

一個Spark job的生命歷程

war result onf blog 超過 lsit fde mark ensure 一個job的生命歷程 dagScheduler.runJob //（1） --> submitJob ( eventProcessLoop.post(JobSubmitted,

Spark（六）Spark任務提交方式和執行流程

sla handles 解析 nod 就會 clust 它的管理機 nag 一、Spark中的基本概念（1）Application：表示你的應用程序（2）Driver：表示main()函數，創建SparkContext。由SparkContext負責與Cluste

Spark job 部署模式

mys har clu exe 技術 fail emp pre exec 　　Spark job 的部署有兩種模式，Client && Cluster 　　spark-submit .. --deploy-mode client | cluster

【待補充】Spark 集群模式 && Spark Job 部署模式

啟動 nbsp -s .cn 一個 sos ref 說明 www 0. 說明　　Spark 集群模式 && Spark Job 部署模式 1. Spark 集群模式　　[ Local ]　　使用一個 JVM 模擬 Spark 集群　

spark 作業提交

kafka-topics.sh --describe --zookeeper xxxxx:2181 --topic testkafka-run-class.sh kafka.tools.GetOffsetShell --topic test --broker-list xxxxxx:9092 --time -

大資料基礎之Spark（1）Spark Submit即Spark任務提交過程

Spark版本2.1.1 一 Spark Submit本地解析 1.1 現象提交命令： spark-submit --master local[10] --driver-memory 30g --class app.package.AppClass app-1

Spark應用提交

在 Spark 的 bin 目錄中的 spark-submit 指令碼用與在叢集上啟動應用程式。它可以通過一個統一的介面使用所有 Spark 支援的 Cluster Manager，所以您不需要專門的為每個&n

spark應用提交報錯解決

老樣子，菜雞一隻！如果有什麼說錯的，還請大家見諒~！！也希望大家幫我指出改正。本次測試的spark版本是2.2.X的，從打包開始，打包有兩種方式，一種是maven打包，一種是idea打包（據說這種打包方式叫SBT？maybe把），然後打出來的包也有兩種，一種叫胖包，一種叫瘦包。我先打了

Spark job 提交

相關推薦