7、spark的生產應用提交指令碼spark-submit
阿新 • • 發佈:2021-02-05
一、通過查詢命令spark-submit --help 來檢視提交任務時有哪些選項可以用。
Options: | 說明 | 備註 |
--master MASTER_URL | spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). | 常用local本地模式、yarn叢集模式 |
--deploy-mode DEPLOY_MODE | Whether to launch the driver program locally ("client") oron one of the worker machines inside the cluster ("cluster")(Default: client). | 驅動程式是本地客戶端client啟動還是叢集cluster上的工作節點啟動. 如果是cluster模式,Yarn叢集會管理driver程序,application建立後,client客戶端就可以退出了。 如果是client模式,driver程序會跑在client客戶端程序中,Yarn只負責保證執行節點的資源,並不會管理master節點。 |
--class CLASS_NAME | Your application's main class (for Java / Scala apps). | Java/Scala指令碼的main class |
--name NAME | A name of your application. | 給應用一個名稱 |
--jars JARS | Comma-separated list of jars to include on the driverand executor classpaths. | 逗號分隔的jar包列表,會載入到驅動、執行節點的路徑上 |
--packages | Comma-separated list of maven coordinates of jars to includeon the driver and executor classpaths. Will search the localmaven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. | 逗號分隔的maven座標下的package包列表,會載入到驅動、執行節點的路徑上。會搜尋本地的maven資源庫或遠端資源池來載入。 |
--exclude-packages | Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. | 逗號分隔的package包,在解析依賴的時候會排除不解析,防止依賴衝突。 |
--repositories | Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. | |
--py-files PY_FILES | Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. | 逗號分隔的.zip , .egg, .py檔案列表 |
--files FILES | Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). | 逗號分隔的檔案列表,替換工作節點路徑下的檔案。檔案可以通過SparkFiles.get(fileName)獲取 |
--conf, -c PROP=VALUE | Arbitrary Spark configuration property. | 配置選項 |
--properties-file FILE | Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. | 配置檔案 |
--driver-memory MEM | Memory for driver (e.g. 1000M, 2G) (Default: 1024M). | 驅動節點記憶體 |
--driver-java-options | Extra Java options to pass to the driver. | |
--driver-library-path | Extra library path entries to pass to the driver. | |
--driver-class-path | Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. | |
--executor-memory MEM | Memory per executor (e.g. 1000M, 2G) (Default: 1G). | 執行節點記憶體 |
--proxy-user NAME | User to impersonate when submitting the application. This argument does not work with --principal / --keytab. | |
--help, -h | Show this help message and exit. | spark-submit --help 獲取命令列幫助 |
--verbose, -v | Print additional debug output. | |
--version, | Print the version of current Spark. | spark-submit -version 檢視當前版本號 |
Cluster deploy mode only: | 只適用於叢集部署模式的命令 | |
--driver-cores NUM | Number of cores used by the driver, only in cluster mode (Default: 1). | |
Spark standalone or Mesos with cluster deploy mode only: | ||
--supervise | If given, restarts the driver on failure. | |
Spark standalone, Mesos or K8s with cluster deploy mode only: | ||
--kill SUBMISSION_ID | If given, kills the driver specified. | |
--status SUBMISSION_ID | If given, requests the status of the driver specified. | |
Spark standalone, Mesos and Kubernetes only: | ||
--total-executor-cores NUM | Total cores for all executors. | |
Spark standalone, YARN and Kubernetes only: | ||
--executor-cores NUM | Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode). | |
Spark on YARN and Kubernetes only: | 適用於Yarn和Kubernetes部署模式的命令 | |
--num-executors NUM | Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. | 執行節點個數 |
--principal PRINCIPAL | Principal to be used to login to KDC. | |
--keytab KEYTAB | The full path to the file that contains the keytab for the principal specified above. | |
Spark on YARN only: | 只適用於Yarn部署模式的命令 | |
--queue QUEUE_NAME | The YARN queue to submit to (Default: "default"). | 佇列名稱 |
--archives ARCHIVES | Comma separated list of archives to be extracted into the working directory of each executor. |
二、scala指令碼spark-submit
1、yarn叢集模式
1.1 spark-submit 命令模版
spark-submit --class TestClass
--master yarn \
--queue ${指定佇列名稱} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
AtestSparkApplication.jar
1.2 object指令碼示例
2、local本地模式
2.1 spark-submit 命令模版
spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
AtestSparkApplication.jar
2.2 object指令碼示例
三、python指令碼spark-submit
1、yarn叢集模式
1.1 spark-submit 命令模版
(1)一個python指令碼,無任何其他依賴檔案的情況
spark-submit \
--master yarn \
--queue ${這是叢集的佇列} \
--deploy-mode client \
--driver-memory 4G \
--driver-cores 4 \
--executor-memory 8G \
--executor-cores 4 \
--num-executors 100 \
--conf spark.default.parallelism=1600 \
--name "spark_demo_yarn" \
pyspark_example_yarn.py
(2)一個python指令碼,加上一個/多個 txtfile的情況
(3)一個python指令碼,加上一個/多個依賴python指令碼的情況
1.2指令碼示例:pyspark_example_yarn.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
if __name__ == '__main__':
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.enableHiveSupport() \
.getOrCreate()
df = spark.sql("""
SELECT
COUNT(a.user_id)
FROM
(
SELECT
user_id
FROM
app.app_purchase_table
WHERE
dt >= "2019-01-01"
AND dt <= "2020-12-31"
AND sku_code IN(700052, 721057)
GROUP BY
user_id
)
a
""")
df.show()
2、local本地模式
2.1 spark-submit 命令模版
spark-submit \
--master local \
--deploy-mode client \
--name "spark_demo_local" \
pyspark_example_local.py
2.2 指令碼示例:pyspark_example_local.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
if __name__ == '__main__':
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.enableHiveSupport() \
.getOrCreate()
print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
spark.range(500).where("id > 400").show()