Spark job 的觸發

阿新 • • 發佈：2018-12-04

判斷是RDD action的操作的一個標誌是
其函式實現裡得有

sc.runJob

RDD 是怎麼觸發job的

以 rdd.count 為例

RDD.scala

/**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

SparkContext.scala

/**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

DataFrame 是怎麼觸發job的

以 df.count 為例
比rdd 要囉嗦很多

DataSet.scala

/**
   * Returns the number of rows in the Dataset.
   * @group action
   * @since 1.6.0
   */
  def count(): Long = withAction("count", groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0)
  }

SparkPlan.scala

/**
   * Runs this query returning the result as an array.
   */
  def executeCollect(): Array[InternalRow] = {
    val byteArrayRdd = getByteArrayRdd()

    val results = ArrayBuffer[InternalRow]()
    byteArrayRdd.collect().foreach { countAndBytes =>
      decodeUnsafeRows(countAndBytes._2).foreach(results.+=)
    }
    results.toArray
  }

RDD.scala

/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

SparkContext.scala

/**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

至此殊途同歸

spark.read.parquet 也能觸發job

val df: DataFrame = spark.read.parquet("/tmp/spark-975aa02f-fa2e-4b03-a3bd-7d57e3927787/" +
      "part-00000-61784fcc-e6cc-4ea8-bb14-7405f680681d.snappy.parquet"

DataFrameReader.scala

def parquet(path: String): DataFrame = {
    // This method ensures that calls that explicit need single argument works, see SPARK-16009
    parquet(Seq(path): _*)
  }

DataFrameReader.scala

def parquet(paths: String*): DataFrame = {
    format("parquet").load(paths: _*)
  }

DataFrameReader.scala

def load(paths: String*): DataFrame = {
    sparkSession.baseRelationToDataFrame(
      DataSource.apply(
        sparkSession,
        paths = paths,
        userSpecifiedSchema = userSpecifiedSchema,
        className = source,
        options = extraOptions.toMap).resolveRelation())
  }

DataSource.scala

def resolveRelation()
{
...
format.inferSchema(
        sparkSession,
        caseInsensitiveOptions,
        tempFileIndex.allFiles())
        ...
        }

ParquetFileFormat.scala

override def inferSchema()
{
...
ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)
...
}

ParquetFileFormat.scala

def mergeSchemasInParallel()
{
...
// Issues a Spark job to read Parquet schema in parallel.
    val partiallyMergedSchemas =
      sparkSession
        .sparkContext
        .parallelize(partialFileStatusInfo, numParallelism)
        .mapPartitions { iterator =>
          // Resembles fake `FileStatus`es with serialized path and length information.
          val fakeFileStatuses = iterator.map { case (path, length) =>
            new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
          }.toSeq

          // Reads footers in multi-threaded manner within each task
          val footers =
            ParquetFileFormat.readParquetFootersInParallel(
              serializedConf.value, fakeFileStatuses, ignoreCorruptFiles)

          // Converter used to convert Parquet `MessageType` to Spark SQL `StructType`
          val converter = new ParquetToSparkSchemaConverter(
            assumeBinaryIsString = assumeBinaryIsString,
            assumeInt96IsTimestamp = assumeInt96IsTimestamp)
          if (footers.isEmpty) {
            Iterator.empty
          } else {
            var mergedSchema = ParquetFileFormat.readSchemaFromFooter(footers.head, converter)
            footers.tail.foreach { footer =>
              val schema = ParquetFileFormat.readSchemaFromFooter(footer, converter)
              try {
                mergedSchema = mergedSchema.merge(schema)
              } catch { case cause: SparkException =>
                throw new SparkException(
                  s"Failed merging schema of file ${footer.getFile}:\n${schema.treeString}", cause)
              }
            }
            Iterator.single(mergedSchema)
          }
        }.collect()
        ...
}

RDD.scala

/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

Spark job 的觸發

判斷是RDD action的操作的一個標誌是其函式實現裡得有 sc.runJob RDD 是怎麼觸發job的以 rdd.count 為例 RDD.scala /** * Return the number of elements in the RDD.

Spark Job具體的物理執行

oca rds 物理 cer mapper shu cal compute 並發即使采用pipeline的方式，函數f對依賴的RDD中的數據集合的操作也會有兩種方式： 1.f(record)，f作用於集合的每一條記錄，每次只作用於一條記錄 2.f(records)，f一次

一個Spark job的生命歷程

war result onf blog 超過 lsit fde mark ensure 一個job的生命歷程 dagScheduler.runJob //（1） --> submitJob ( eventProcessLoop.post(JobSubmitted,

Spark job 部署模式

mys har clu exe 技術 fail emp pre exec 　　Spark job 的部署有兩種模式，Client && Cluster 　　spark-submit .. --deploy-mode client | cluster

【待補充】Spark 集群模式 && Spark Job 部署模式

啟動 nbsp -s .cn 一個 sos ref 說明 www 0. 說明　　Spark 集群模式 && Spark Job 部署模式 1. Spark 集群模式　　[ Local ]　　使用一個 JVM 模擬 Spark 集群　

spark job提交7

當task在executor上執行時最終會在taskrunner中呼叫execBackend.statusUpdate來向driver端傳送狀態更新 \spark-master\core\src\main\scala\org\apache\spark\executor\CoarseGrained

spark job提交6

driver端呼叫launchTasks來向worker節點中的executor傳送啟動任務命令 spark-master\core\src\main\scala\org\apache\spark\scheduler\cluster\CoarseGrainedSchedulerBackend.s

Spark job 提交

Driver 側在任務提交的時候要完成以下幾個工作 RDD依賴分析，以生成DAG 根據DAG 將job 分割成多個 stage stage 一經確認，即生成相應的 task，將生成的task 分發到 Executor 執行提交的實現入口在SparkCont

使用REST API提交Apache Spark Job

使用REST API提交Apache Spark Job 使用Apache spark時，有時需要從群集外部按需觸發Spark作業。我們可以通過兩種方式在群集中提交Apache spark作業。 S

Spark Job Server 0.7.0部署和使用

##安裝Scala 在Scala官網下載合適的版本解壓到/usr/local/scala目錄下(目錄可隨意修改) 在linux下加入環境變數 export PATH="$PATH:/usr/scala/bin" 輸入scala檢查是否安裝成功 ##手

spark job提交：spark-submit

一般化的引數： spark-submit \ --supervise #driver失敗時重啟 \ --name wordcount \ --master yarn-cluster \ --num-executors 100 \ --executor-m

如何使用yarn介面檢視spark job執行的情況

如下圖所示，是一個spark任務的第一個Job，job0的執行情況。 job1的執行情況如下：可以看到job0，job1的input不一樣，shuffle read，shuffle write的大小也不一樣，但是每個job的shuffle read是和shuffl

【原創】經驗分享（20）spark job之間會停頓幾分鐘

今天遇到一個問題，spark應用中在一個迴圈裡執行sql，每個sql都會向一張表寫入資料，比如 insert overwrite table test_table partition(dt) select * from test_table_another; 除了執行sql沒有其他邏輯，每個sq

Spark-利用SparkLauncher 類以JAVA API 程式設計的方式提交spark job

一.環境說明和使用軟體的版本說明: hadoop-version:hadoop-2.9.0.tar.gz spark-version:spark-2.2.0-bin-hadoop2.7.tgz java-version:jdk1.8.0_151 叢集環境：單機偽分散式環

spark job一直顯示Application report

[2016-12-02T15:08:15.746+08:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:bdd] Application report for

Spark Job (thrift-server) 動態分配資源

專案中使用spark 自帶的thrift-server做hdfs資料查詢統計服務。當資料過多時，thrift-server 需要分配大量的資源，當不需要查詢時分配當量的資源又顯得非常浪費，因此想到是否有可能動態分配資源。（thrift-server也是一個j

spark job提交過程

Driver的任務提交過程 1、Driver程式的程式碼執行到action操作，觸發了SparkContext的runJob方法。 2、SparkContext呼叫DAGScheduler的runJob函式。 3、DAGScheduler把Job劃分s

spark job 獨立的log4j配置方法

spark中提供了log4j的方式記錄日誌。可以在$SPARK_HOME/conf/下，將 log4j.properties.template 檔案copy為 log4j.properties 來啟用

利用spring-boot構建spark job作業提交服務

版本依賴 spark.version：2.1.0 hadoop.version：2.6.5 springboot-mybatis.version：1.1.1 springboot：1.5.10

job觸發流程原理剖析與原始碼分析

以wordcount流程解析 val lines = sc.textFile() def textFile( path: String, minPartitions:

Spark job 的觸發

RDD 是怎麼觸發job的

DataFrame 是怎麼觸發job的

spark.read.parquet 也能觸發job

相關推薦