Spark 觸發Job提交

阿新 • • 發佈：2018-12-05

Spark 觸發Job提交

youtube 視訊說明

Spark 觸發Job提交(youtube視訊) : https://youtu.be/X49RIqz2AjM

bilibili 視訊說明

Spark 觸發Job提交(bilibili視訊) : https://www.bilibili.com/video/av37445008/

客戶端原始碼

github: https://github.com/opensourceteams/spark-scala-maven
BaseScalaSparkContext.scala

package com.opensource.bigdata.spark.standalone.base


import org.apache.spark.{SparkConf, SparkContext}

class BaseScalaSparkContext {

  var appName = "standalone"
  var master = "spark://standalone.com:7077" //本地模式:local     standalone:spark://master:7077


  def sparkContext(): SparkContext = {
    val conf = new SparkConf().setAppName(appName).setMaster(master)
    conf.set("spark.eventLog.enabled","true")
   // conf.set("spark.ui.port","10002")
    conf.set("spark.history.fs.logDirectory","hdfs://standalone.com:9000/spark/log/historyEventLog")
    conf.set("spark.eventLog.dir","hdfs://standalone.com:9000/spark/log/eventLog")
    //executor debug,是在提交作的地方讀取
    //conf.set("spark.executor.extraJavaOptions","-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=10002")
    conf.setJars(Array("/opt/n_001_workspaces/bigdata/spark-scala-maven/target/spark-scala-maven-1.0-SNAPSHOT.jar"))

    val sc = new SparkContext(conf)

    //設定日誌級別
    //sc.setLogLevel("ERROR")
    sc
  }
}

WorldCount.scala

package com.opensource.bigdata.spark.standalone.wordcount

import com.opensource.bigdata.spark.standalone.base.BaseScalaSparkContext

object WorldCount extends BaseScalaSparkContext{


  def main(args: Array[String]): Unit = {

    val startTime = System.currentTimeMillis()

    appName = "HelloWorld-standalone"
    //master="spark://10.211.55.2:7077"
    val sc = sparkContext

    println("SparkContext載入完成")


    val distFile:org.apache.spark.rdd.RDD[String] = sc.textFile("hdfs://standalone.com:9000/opt/data/a.txt")
    println(distFile)

   val result = distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
    println(s"結果:${result.collect().mkString}")

    val threadName = Thread.currentThread().getId + Thread.currentThread().getName

    println(s"${threadName}===================結果:執行了毫秒:${System.currentTimeMillis() - startTime}")


    sc.stop()

  }
}

原始碼分析

worldCount.scala

RDD之間的依賴關係

val distFile:org.apache.spark.rdd.RDD[String] = sc.textFile("hdfs://standalone.com:9000/opt/data/a.txt")
val result = distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
-----------------------------------------------------------------
val rdd0 =  HadoopRDD
val rdd1 =  distFile = sc.textFile 內部進行了一次map操作，hadoopRDD.map(pair => pair._2.toString)
val rdd2 =  distFile.flatMap(_.split(" "))
val rdd3 =  distFile.flatMap(_.split(" ")).map((_,1)
val rdd4 =  distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

-----------------------------------------------------------------
ShuffledRDD[4]     ->      ShuffleDependency        ->   rdd4
MapPartitionsRDD[3]     ->      OneToOneDependency(NarrowDependency)		->   rdd3
MapPartitionsRDD[2]     ->      OneToOneDependency(NarrowDependency)		->   rdd2
MapPartitionsRDD[1]     ->      OneToOneDependency(NarrowDependency)		->   rdd1
HadoopRDD[0]     ->      Nil		->   rdd0
-----------------------------------------------------------------

SparkContext中runJob呼叫

RDD的collect方法，呼叫SparkContext的runJob方法

  /**
   * Return an array that contains all of the elements in this RDD.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

SparkContext runJob方法呼叫

  /**
   * Run a job on all partitions in an RDD and return the results in an array.
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

  /**
   * Run a job on a given set of partitions of an RDD, but take a function of type
   * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

  /**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

DagScheduler方法呼叫

DagScheduler中runJob方法呼叫

 /**
   * Run an action job on the given RDD and pass all the results to the resultHandler function as
   * they arrive.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @throws Exception when the job fails
   */
  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    waiter.awaitResult() match {
      case JobSucceeded =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

DagScheduler submitJob 方法呼叫

 /**
   * Submit an action job to the scheduler.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @return a JobWaiter object that can be used to block until the job finishes executing
   *         or can be used to cancel the job.
   *
   * @throws IllegalArgumentException when partitions ids are illegal
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

DAGSchedulerEventProcessLoop 中runJob方法呼叫

DAGScheduler事件迴圈器中傳送事件：JobSubmitted

    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))

Spark 觸發Job提交

Spark 觸發Job提交更多資源 github: https://github.com/opensourceteams/spark-scala-maven csdn(彙總視訊線上看): https://blog.csdn.net/thinktothings/arti

[Spark Core] Spark Client Job 提交三級調度框架

bsp res track cati ive trac htm action 面向 0. 說明　　官方文檔 Job Scheduling 　　Spark 調度核心組件: DagScheduler TaskScheduler BackendScheduler

Spark 2.x 提交Job原始碼淺析

大家都知道，spark job的提交是觸發了Action操作，現在我在RDD.scala中找到collect運算元，在這下面是有一個runjob方法 def collect(): Array[T] = withScope { val results = sc.runJob(th

spark job提交7

當task在executor上執行時最終會在taskrunner中呼叫execBackend.statusUpdate來向driver端傳送狀態更新 \spark-master\core\src\main\scala\org\apache\spark\executor\CoarseGrained

spark job提交6

driver端呼叫launchTasks來向worker節點中的executor傳送啟動任務命令 spark-master\core\src\main\scala\org\apache\spark\scheduler\cluster\CoarseGrainedSchedulerBackend.s

Spark job 提交

Driver 側在任務提交的時候要完成以下幾個工作 RDD依賴分析，以生成DAG 根據DAG 將job 分割成多個 stage stage 一經確認，即生成相應的 task，將生成的task 分發到 Executor 執行提交的實現入口在SparkCont

Spark啟動流程與job提交流程

Driver端首先啟動SparkSubmit程序，啟動後開始於Master進行通訊，此時建立了了一個非常重要的物件（SparkContext），接著向Master傳送任務資訊； Master接收到資訊後。開始資源排程，此時會和所有的Worker進行通訊，找到比較空閒的Worker，並通知Worker來取任務

spark job提交：spark-submit

一般化的引數： spark-submit \ --supervise #driver失敗時重啟 \ --name wordcount \ --master yarn-cluster \ --num-executors 100 \ --executor-m

spark job提交過程

Driver的任務提交過程 1、Driver程式的程式碼執行到action操作，觸發了SparkContext的runJob方法。 2、SparkContext呼叫DAGScheduler的runJob函式。 3、DAGScheduler把Job劃分s

向spark集群提交消費kafka應用時kafka鑒權配置問題

clu params pac tid version 屬性。包含 conf red 提交消費kafka應用裏面包含sasl.jaas.config，通常需要配置文件。但是打成jar包後的應用，通過classload讀不到jar包中配置文件。需要初始化kafka時增加pro

spark-streaming任務提交遇到的坑

option 優化機器 -s shc ref top 解決而是一、背景　　基本所有公司互聯網公司都會有離線和實時任務，達到實時的目的手段據個人了解有storm、spark-streaming、flink。今天來探討一下spark-streaming任務的開發到上線過

Spark-原始碼-Spark-Submit 任務提交

Spark 版本:1.3 呼叫shell, spark-submit.sh args[] 首先是進入 org.apache.spark.deploy.SparkSubmit 類中呼叫他的 main() 方法 def main(args: Array[String])

Spark 配置及提交應用

export SPARK_MASTER_WEBUI_PORT=8888export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/xxx/sparkapp/lib/mysql-connector-java-5.1.27.jar:/home/xxx/sparkapp/lib/hb

srotByKey為什麼會觸發job

大家都只知道srotByKey是一個transformation運算元，而transformation型別的運算元是不觸發Job的，但是有心的人會注意到，在我們呼叫sortByKey這個運算元時，可以從UI介面看到，sortByKey竟然會觸發Job 其實sortByKey 在執行時會對分

Spark-submit方式提交spark任務

1. 提交命令 export ALL_JARS='--jars /ljj/hbase/phoenix-4.10.0-HBase-1.1-client.jar,/ljj/hbase/phoenix-spark-4.10.0-HBase-1.1.jar,/ljj/hbase/phoen

大資料-Hadoop生態(13)-MapReduce框架原理--Job提交原始碼和切片原始碼解析

1.MapReduce的資料流 1) Input -> Mapper階段輸入源是一個檔案,經過InputFormat之後,到了Mapper就成了K,V對,以上一章的流量案例來說,經過InputFormat之後,變成了手機號為key,這一行資料為value的K,V對,所以這裡我們可以自定義Inp

spark中job stage task關係

1.1 例子，美國 1880 － 2014 年新生嬰兒資料統計目標：用美國 1880 － 2014 年新生嬰兒的資料來做做簡單的統計資料格式：每年的新生嬰兒資料在一個檔案裡面每個檔案的每一條資料格式：姓名,性別,新生人數 1.2 執行流程概覽上面

Spark rest api 提交監控任務

場景有些時候我們希望搭建自己的web平臺對spark任務進行啟動、監控和管理。spark也提供了restful api對任務進行監控，但是對於如何從外部提交任務並沒有說明。一、提交任務我們使用java進行後臺呼叫restful api,附上提交任務的格式：

大資料之MapReduce瞭解及MapReduce Job提交到Yarn的工作流程

MapReduce主要用途是進行分散式計算一、MapReduce理解巨集觀上的理解： MapReduce僅僅是作為客戶端（Client）把程式碼程式提交到Yarn平臺上，MapReduce jar在Yarn上執行，屬於客戶端提交的過程，hdfs上傳的命令。

MapReduce job提交到Yarn的工作流程

使用者向Yarn的RM提交應用程式，其中包括ApplicationMaster程式（應用程式的主程式），啟動ApplicationMaster命令等 RM首先為該app程式分配第一個container容器，並與對應的NM通訊，要求NM在這個Container中啟動應用程式的application maste

Spark 觸發Job提交

Spark 觸發Job提交

更多資源

youtube 視訊說明

bilibili 視訊說明

客戶端原始碼

原始碼分析

worldCount.scala

RDD之間的依賴關係

SparkContext中runJob呼叫

DagScheduler方法呼叫

DAGSchedulerEventProcessLoop 中runJob方法呼叫

相關推薦