1. 程式人生 > 其它 >7、spark的生產應用提交指令碼spark-submit

7、spark的生產應用提交指令碼spark-submit

技術標籤:Spark權威指南spark-submit

一、通過查詢命令spark-submit --help 來檢視提交任務時有哪些選項可以用。

Options:說明備註
--master MASTER_URLspark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
常用local本地模式、yarn叢集模式 
--deploy-mode DEPLOY_MODEWhether to launch the driver program locally ("client") oron one of the worker machines inside the cluster ("cluster")(Default: client).

驅動程式是本地客戶端client啟動還是叢集cluster上的工作節點啟動.

如果是cluster模式,Yarn叢集會管理driver程序,application建立後,client客戶端就可以退出了。

如果是client模式,driver程序會跑在client客戶端程序中,Yarn只負責保證執行節點的資源,並不會管理master節點。

--class CLASS_NAME Your application's main class (for Java / Scala apps).Java/Scala指令碼的main class
--name NAMEA name of your application.給應用一個名稱
--jars JARSComma-separated list of jars to include on the driverand executor classpaths.逗號分隔的jar包列表,會載入到驅動、執行節點的路徑上
--packagesComma-separated list of maven coordinates of jars to includeon the driver and executor classpaths. Will search the localmaven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.逗號分隔的maven座標下的package包列表,會載入到驅動、執行節點的路徑上。會搜尋本地的maven資源庫或遠端資源池來載入。
--exclude-packagesComma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
逗號分隔的package包,在解析依賴的時候會排除不解析,防止依賴衝突。
--repositoriesComma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
 
--py-files PY_FILESComma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.逗號分隔的.zip , .egg, .py檔案列表 
--files FILESComma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).逗號分隔的檔案列表,替換工作節點路徑下的檔案。檔案可以通過SparkFiles.get(fileName)獲取
--conf, -c PROP=VALUEArbitrary Spark configuration property.配置選項
--properties-file FILEPath to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.配置檔案 
--driver-memory MEMMemory for driver (e.g. 1000M, 2G) (Default: 1024M).

驅動節點記憶體

--driver-java-optionsExtra Java options to pass to the driver. 
--driver-library-pathExtra library path entries to pass to the driver. 
--driver-class-pathExtra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. 
--executor-memory MEMMemory per executor (e.g. 1000M, 2G) (Default: 1G).執行節點記憶體
--proxy-user NAMEUser to impersonate when submitting the application. This argument does not work with --principal / --keytab.
--help, -hShow this help message and exit.spark-submit --help 獲取命令列幫助
--verbose, -vPrint additional debug output.
--version,Print the version of current Spark.spark-submit -version 檢視當前版本號 
Cluster deploy mode only:  只適用於叢集部署模式的命令
--driver-cores NUMNumber of cores used by the driver, only in cluster mode (Default: 1). 
Spark standalone or Mesos with cluster deploy mode only:  
--superviseIf given, restarts the driver on failure. 
Spark standalone, Mesos or K8s with cluster deploy mode only:  
--kill SUBMISSION_IDIf given, kills the driver specified. 
--status SUBMISSION_IDIf given, requests the status of the driver specified. 
Spark standalone, Mesos and Kubernetes only:  
--total-executor-cores NUMTotal cores for all executors. 
Spark standalone, YARN and Kubernetes only:  
--executor-cores NUMNumber of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode). 
Spark on YARN and Kubernetes only: 適用於Yarn和Kubernetes部署模式的命令
--num-executors NUMNumber of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.執行節點個數 
--principal PRINCIPALPrincipal to be used to login to KDC. 
--keytab KEYTABThe full path to the file that contains the keytab for the principal specified above. 
Spark on YARN only:  只適用於Yarn部署模式的命令
--queue QUEUE_NAMEThe YARN queue to submit to (Default: "default"). 佇列名稱
--archives ARCHIVESComma separated list of archives to be extracted into the working directory of each executor. 

二、scala指令碼spark-submit

1、yarn叢集模式

1.1 spark-submit 命令模版

spark-submit --class TestClass
--master yarn \
--queue ${指定佇列名稱} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
AtestSparkApplication.jar

1.2 object指令碼示例

2、local本地模式

2.1 spark-submit 命令模版

spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
AtestSparkApplication.jar

2.2 object指令碼示例

三、python指令碼spark-submit

1、yarn叢集模式

1.1 spark-submit 命令模版

(1)一個python指令碼,無任何其他依賴檔案的情況

spark-submit \
 --master yarn \
 --queue ${這是叢集的佇列} \
 --deploy-mode client \
 --driver-memory 4G \
 --driver-cores 4 \
 --executor-memory 8G \
 --executor-cores 4 \
 --num-executors 100 \
 --conf spark.default.parallelism=1600 \
 --name "spark_demo_yarn" \
 pyspark_example_yarn.py 

(2)一個python指令碼,加上一個/多個 txtfile的情況

(3)一個python指令碼,加上一個/多個依賴python指令碼的情況

1.2指令碼示例:pyspark_example_yarn.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    df = spark.sql("""
            SELECT
	            COUNT(a.user_id)
            FROM
	            (
		            SELECT
			            user_id
		            FROM
			            app.app_purchase_table
		            WHERE
			            dt >= "2019-01-01"
			            AND dt <= "2020-12-31"
			            AND sku_code IN(700052, 721057)
		            GROUP BY
			            user_id
                )
                a
            """)
    df.show()

2、local本地模式

2.1 spark-submit 命令模版

spark-submit \
 --master local \
 --deploy-mode client \
 --name "spark_demo_local" \
 pyspark_example_local.py 

2.2 指令碼示例:pyspark_example_local.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
    spark.range(500).where("id > 400").show()