1. 程式人生 > >spark 部署模式和啟動程序

spark 部署模式和啟動程序

Cluster Mode Overview (叢集模式概述)

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Spark 應用在叢集上作為獨立的程序組來執行,在您的 main 程式中通過 SparkContext 來協調(稱之為 driver 程式)。

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

具體的說,為了執行在叢集上,SparkContext 可以連線至幾種型別的 Cluster Manager(既可以用 Spark 自己的 Standlone Cluster Manager,或者 Mesos,也可以使用 YARN),它們會分配應用的資源。一旦連線上,Spark 獲得叢集中節點上的 Executor,這些程序可以執行計算並且為您的應用儲存資料。接下來,它將傳送您的應用程式碼(通過 JAR 或者 Python 檔案定義傳遞給 SparkContext)至 Executor。最終,SparkContext 將傳送 Task 到 Executor 以執行。

Spark Standalone Mode(獨立叢集模式)

Launching Spark Applications (啟動應用)

The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster.
For standalone clusters, Spark currently supports two deploy modes.

  • In client mode, the driver is launched in the same process as the client that submits the application. (client driver在同一個 process)
  • In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish. (driver 由Worker 啟動,client 很快退出)

jar distributed

If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).

Launching Spark on YARN (Yarn 模式)

There are two deploy modes that can be used to launch Spark applications on YARN.

  • In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
  • In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

有兩種部署模式可以用於在 YARN 上啟動 Spark 應用程式。在 cluster 叢集模式下, Spark driver 執行在叢集上由 YARN 管理的application master 程序內,並且客戶端可以在初始化應用程式後離開。在 client 客戶端模式下,driver 在客戶端程序中執行,並且 application master 僅用於從 YARN 請求資源。

spark-submit (submit application)

In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.

Term (術語)

  • Application: User program built on Spark. Consists of a driver program and executors on the cluster.
  • Driver program: The process running the main() function of the application and creating the SparkContext
  • Cluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
  • Deploy mode: Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
  • Worker node: Any node that can run application code in the cluster
  • Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
  • Task: A unit of work that will be sent to one executor
  • Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs.
  • Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.

程序對應關係

一個executor 對應 一個CoarseGrainedExecutorBackend

spark-submit --master yarn --deploy-mode client

RM會在叢集中的某個NodeManager上,啟動一個ExecutorLauncher程序,來做為ApplicationMaster。另外,也會在多個NodeManager上生成CoarseGrainedExecutorBackend程序來併發的執行應用程式。

spark-submit --master yarn --deploy-mode cluster

Resource Manager在叢集中的某個NodeManager上執行ApplicationMaster,該AM同時會執行driver程式。緊接著,會在各NodeManager上執行CoarseGrainedExecutorBackend來併發執行應用程式。