7、spark的生產應用提交指令碼spark-submit

阿新 • • 發佈：2021-02-05

一、通過查詢命令spark-submit --help 來檢視提交任務時有哪些選項可以用。

Options:	說明	備註
--master MASTER_URL	spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]).	常用local本地模式、yarn叢集模式
--deploy-mode DEPLOY_MODE	Whether to launch the driver program locally ("client") oron one of the worker machines inside the cluster ("cluster")(Default: client).	驅動程式是本地客戶端client啟動還是叢集cluster上的工作節點啟動. 如果是cluster模式，Yarn叢集會管理driver程序，application建立後，client客戶端就可以退出了。如果是client模式，driver程序會跑在client客戶端程序中，Yarn只負責保證執行節點的資源，並不會管理master節點。
--class CLASS_NAME	Your application's main class (for Java / Scala apps).	Java/Scala指令碼的main class
--name NAME	A name of your application.	給應用一個名稱
--jars JARS	Comma-separated list of jars to include on the driverand executor classpaths.	逗號分隔的jar包列表，會載入到驅動、執行節點的路徑上
--packages	Comma-separated list of maven coordinates of jars to includeon the driver and executor classpaths. Will search the localmaven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.	逗號分隔的maven座標下的package包列表，會載入到驅動、執行節點的路徑上。會搜尋本地的maven資源庫或遠端資源池來載入。
--exclude-packages	Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.	逗號分隔的package包,在解析依賴的時候會排除不解析，防止依賴衝突。
--repositories	Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES	Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.	逗號分隔的.zip , .egg, .py檔案列表
--files FILES	Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).	逗號分隔的檔案列表，替換工作節點路徑下的檔案。檔案可以通過SparkFiles.get(fileName)獲取
--conf, -c PROP=VALUE	Arbitrary Spark configuration property.	配置選項
--properties-file FILE	Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.	配置檔案
--driver-memory MEM	Memory for driver (e.g. 1000M, 2G) (Default: 1024M).	驅動節點記憶體
--driver-java-options	Extra Java options to pass to the driver.
--driver-library-path	Extra library path entries to pass to the driver.
--driver-class-path	Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.
--executor-memory MEM	Memory per executor (e.g. 1000M, 2G) (Default: 1G).	執行節點記憶體
--proxy-user NAME	User to impersonate when submitting the application. This argument does not work with --principal / --keytab.
--help, -h	Show this help message and exit.	spark-submit --help 獲取命令列幫助
--verbose, -v	Print additional debug output.
--version,	Print the version of current Spark.	spark-submit -version 檢視當前版本號
Cluster deploy mode only:		只適用於叢集部署模式的命令
--driver-cores NUM	Number of cores used by the driver, only in cluster mode (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise	If given, restarts the driver on failure.
Spark standalone, Mesos or K8s with cluster deploy mode only:
--kill SUBMISSION_ID	If given, kills the driver specified.
--status SUBMISSION_ID	If given, requests the status of the driver specified.
Spark standalone, Mesos and Kubernetes only:
--total-executor-cores NUM	Total cores for all executors.
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM	Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode).
Spark on YARN and Kubernetes only:		適用於Yarn和Kubernetes部署模式的命令
--num-executors NUM	Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.	執行節點個數
--principal PRINCIPAL	Principal to be used to login to KDC.
--keytab KEYTAB	The full path to the file that contains the keytab for the principal specified above.
Spark on YARN only:		只適用於Yarn部署模式的命令
--queue QUEUE_NAME	The YARN queue to submit to (Default: "default").	佇列名稱
--archives ARCHIVES	Comma separated list of archives to be extracted into the working directory of each executor.

二、scala指令碼spark-submit

1、yarn叢集模式

1.1 spark-submit 命令模版

spark-submit --class TestClass
--master yarn \
--queue ${指定佇列名稱} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
AtestSparkApplication.jar

1.2 object指令碼示例

2、local本地模式

2.1 spark-submit 命令模版

spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
AtestSparkApplication.jar

2.2 object指令碼示例

三、python指令碼spark-submit

1、yarn叢集模式

1.1 spark-submit 命令模版

（1）一個python指令碼，無任何其他依賴檔案的情況

spark-submit \
 --master yarn \
 --queue ${這是叢集的佇列} \
 --deploy-mode client \
 --driver-memory 4G \
 --driver-cores 4 \
 --executor-memory 8G \
 --executor-cores 4 \
 --num-executors 100 \
 --conf spark.default.parallelism=1600 \
 --name "spark_demo_yarn" \
 pyspark_example_yarn.py

（2）一個python指令碼，加上一個/多個 txtfile的情況

（3）一個python指令碼，加上一個/多個依賴python指令碼的情況

1.2指令碼示例:pyspark_example_yarn.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    df = spark.sql("""
            SELECT
	            COUNT(a.user_id)
            FROM
	            (
		            SELECT
			            user_id
		            FROM
			            app.app_purchase_table
		            WHERE
			            dt >= "2019-01-01"
			            AND dt <= "2020-12-31"
			            AND sku_code IN(700052, 721057)
		            GROUP BY
			            user_id
                )
                a
            """)
    df.show()

2、local本地模式

2.1 spark-submit 命令模版

spark-submit \
 --master local \
 --deploy-mode client \
 --name "spark_demo_local" \
 pyspark_example_local.py

2.2 指令碼示例:pyspark_example_local.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
    spark.range(500).where("id > 400").show()

7、spark的生產應用提交指令碼spark-submit

技術標籤：Spark權威指南spark-submit 一、通過查詢命令spark-submit --help 來檢視提交任務時有哪些選項可以用。

spark提交命令 spark-submit 的引數 executor-memory、executor-cores、num-executors、spark.default.parallelism分析

轉載：https://blog.csdn.net/zimiao552147572/article/details/96482120 nohup spark-submit --master yarn

Spark（第二節）Spark角色介紹，Spark提交任務引數解釋，分別基於spark-shell、scala、java開發單詞計數方法

技術標籤：spark 目錄 Spark角色介紹Spark提交任務引數解釋spark任務提交說明一些基本選項引數提交說明

7、Spark SQL

1.請分析SparkSQL出現的原因，並簡述SparkSQL的起源與發展。 spark產生：為了替代Mapreduce，解決Mapreduce計算短板

Spark記憶體模型介紹及Spark應用記憶體優化踩坑記錄

Spark作為一個基於記憶體的分散式計算引擎，其記憶體管理模組在整個系統中扮演著非常重要的角色。理解Spark記憶體管理的基本原理，有助於更好的開發Spark應用程式和進行效能調優。同時，有效率的記憶體使用是Spark應

Spark 系列（五）—— Spark 執行模式與作業提交

一、作業提交 1.1spark-submit Spark 所有模式均使用 spark-submit 命令提交作業，其格式如下：

資料湖應用解析：Spark on Elasticsearch一致性問題

摘要：髒資料對資料計算的正確性帶來了很嚴重的影響。因此，我們需要探索一種方法，能夠實現Spark寫入Elasticsearch資料的可靠性與正確性。

實驗7、二叉樹的應用

技術標籤：二叉樹java -（1）實驗目的通過該實驗，使學生理解二叉樹的鏈式儲存，掌握二叉樹的幾種遍歷演算法，並通過該實驗使學生理解遞迴的含義，掌握C語言編寫遞迴函式的方法和注意事項。

Spark學習之路十一、SparkCore的調優之Spark記憶體模型

目錄一、概述二、堆內和堆外記憶體規劃 2.1　堆內記憶體 2.2　堆外記憶體 2.3　記憶體管理介面

提交一個spark程式及spark執行器

spark中自帶了很多例子，他們存放在examples/jars目錄下，今天我們就提交一個蒙特.卡羅拉演算法求PI的例子：向spark叢集提交程式的命令是spark-submit，後面我們要制定引數

Spark Streaming應用

技術標籤：Spark大資料sparkhdfshadoop Spark Streaming應用實驗目的深入理解和掌握Spark Stream中DStream無狀態操作的方法；理解Spark Stream程式設計解決實際問題的方法。

Spark SQL應用

技術標籤：Sparkspark大資料hdfshadoop Spark SQL應用實驗目的深入理解和掌握DataFrame各種常見操作和程式設計方法；掌握使用Spark SQL程式設計解決實際問題的方法。

spark任務的提交流程（yarn）

spark任務的提交流程（yarn） spark一般都是部署到yarn上使用的，所以就說y問的最多的就是arn的提交流程，兩種模式最大的區別就是driver端的執行位置

7、Django框架進階1-- 圖書相關表關係建立、基於雙下劃線跨表查詢、聚合查詢、F查詢、Q查詢、django的admin使用、使用指令碼呼叫django（django測試環境）、django檢視原生sql兩種方式

1 圖書相關表關係建立 1 5個表 2 書籍表，作者表，作者詳情表，出版社表，書籍和作者表（多對多關係）

2019級Spark專案應用-電子商務大資料分析總結

一. 資料採集（要求至少爬取三千條記錄，時間跨度超過一星期）資料採集到本地檔案內容

《GT賽車7》獎勵削弱玩家使用指令碼對抗、不需要自己玩遊戲

《GT賽車7》最近的更新引起了不小的爭議，這次更新大幅度降低了獎勵，導致玩家要麼繼續肝地更痛苦，要麼直接氪金。不過一些聰明的玩家開發了一個指令碼，讓遊戲自己執行賺取遊戲內點數。

【Spark研究】如何用 Spark 快速開發應用？

如果你還沒有仔細研究過 Spark （或者還不知道 Spark 是什麼），那麼本文很好地介紹了 Spark。描述了基本的資料結構、shell ，並對其包含的 API 進行了概述。

【重溫mysql】7、死鎖與innodb狀態日誌

InnoDB 採取行級加鎖策略，雖然行級加鎖策略極大提高了程式的併發效能，但由於鎖粒度的減小 InnoDB 執行某些操作的時候可能會同時佔用多個行鎖，加大了鎖衝突的概率；不同事務隔離級別對資料一致性要求不同，如RR級別

Spark 系列（一）—— Spark 簡介

一、簡介 Spark 於 2009 年誕生於加州大學伯克利分校 AMPLab，2013 年被捐贈給 Apache 軟體基金會，2014 年 2 月成為 Apache 的頂級專案。相對於 MapReduce 的批處理計算，Spark 可以帶來上百倍的效能提升，因此它成

Spark 系列（二）—— Spark開發環境搭建

一、安裝Spark 1.1 下載並解壓官方下載地址：spark.apache.org/downloads.h… ，選擇 Spark 版本和對應的 Hadoop 版本後再下載：

7、spark的生產應用提交指令碼spark-submit

一、通過查詢命令spark-submit --help 來檢視提交任務時有哪些選項可以用。

二、scala指令碼spark-submit

三、python指令碼spark-submit

相關推薦