spark從入門到放棄 之 分散式執行jar包
scala程式碼如下:
注意:build path裡不但要用Spark lib資料夾下的spark-assembly-1.5.0-cdh5.5.4-hadoop2.6.0-cdh5.5.4.jar,而且要把Hadoop share/hadoop目錄下的jar新增進來,具體新增哪幾個我也不太清楚,反正都加進去就對了。import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ /** * 統計字元出現次數 */ object WordCount { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: <file>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val line = sc.textFile(args(0)) line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println) sc.stop() } }
用eclipse將其打成jar包
注意:scala的object名不一定要和檔名相同,這一點和java不一樣。例如我的object名為WordCount,但檔名是WC.scala
上傳伺服器
檢視伺服器上測試檔案內容
-bash-4.1$ hadoop fs -cat /user/hdfs/test.txt 張三 張四 張三 張五 李三 李三 李四 李四 李四 王二 老王 老王
執行spark-submit命令,提交jar包
-bash-4.1$ spark-submit --class "WordCount" wc.jar /user/hdfs/test.txt
16/08/22 15:54:17 INFO SparkContext: Running Spark version 1.5.0-cdh5.5.4
16/08/22 15:54:18 INFO SecurityManager: Changing view acls to: hdfs
16/08/22 15:54:18 INFO SecurityManager: Changing modify acls to: hdfs
16/08/22 15:54:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
16/08/22 15:54:19 INFO Slf4jLogger: Slf4jLogger started
16/08/22 15:54:19 INFO Remoting: Starting remoting
16/08/22 15:54:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:// [email protected]:55886]
16/08/22 15:54:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:55886]
16/08/22 15:54:19 INFO Utils: Successfully started service 'sparkDriver' on port 55886.
16/08/22 15:54:19 INFO SparkEnv: Registering MapOutputTracker
16/08/22 15:54:19 INFO SparkEnv: Registering BlockManagerMaster
16/08/22 15:54:19 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eccd9ad6-6296-4508-a9c8-a22b5a36ecbe
16/08/22 15:54:20 INFO MemoryStore: MemoryStore started with capacity 534.5 MB
16/08/22 15:54:20 INFO HttpFileServer: HTTP File server directory is /tmp/spark-bbf694e7-32e2-40b6-88a3-4d97a1d1aab9/httpd-72a45554-b57b-4a5d-af2f-24f198e6300b
16/08/22 15:54:20 INFO HttpServer: Starting HTTP Server
16/08/22 15:54:20 INFO Utils: Successfully started service 'HTTP file server' on port 59636.
16/08/22 15:54:20 INFO SparkEnv: Registering OutputCommitCoordinator
16/08/22 15:54:41 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/08/22 15:54:41 INFO SparkUI: Started SparkUI at http://192.168.56.201:4040
16/08/22 15:54:41 INFO SparkContext: Added JAR file:/var/lib/hadoop-hdfs/wc.jar at http://192.168.56.201:59636/jars/wc.jar with timestamp 1471852481181
16/08/22 15:54:41 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/08/22 15:54:41 INFO RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.201:8032
16/08/22 15:54:41 INFO Client: Requesting a new application from cluster with 2 NodeManagers
16/08/22 15:54:41 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1536 MB per container)
16/08/22 15:54:41 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/08/22 15:54:41 INFO Client: Setting up container launch context for our AM
16/08/22 15:54:41 INFO Client: Setting up the launch environment for our AM container
16/08/22 15:54:41 INFO Client: Preparing resources for our AM container
16/08/22 15:54:42 INFO Client: Uploading resource file:/tmp/spark-bbf694e7-32e2-40b6-88a3-4d97a1d1aab9/__spark_conf__5421268438919389977.zip -> hdfs://hadoop01:8020/user/hdfs/.sparkStaging/application_1471848612199_0005/__spark_conf__5421268438919389977.zip
16/08/22 15:54:43 INFO SecurityManager: Changing view acls to: hdfs
16/08/22 15:54:43 INFO SecurityManager: Changing modify acls to: hdfs
16/08/22 15:54:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
16/08/22 15:54:43 INFO Client: Submitting application 5 to ResourceManager
16/08/22 15:54:43 INFO YarnClientImpl: Submitted application application_1471848612199_0005
16/08/22 15:54:44 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:44 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1471852483082
final status: UNDEFINED
tracking URL: http://hadoop01:8088/proxy/application_1471848612199_0005/
user: hdfs
16/08/22 15:54:45 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:46 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:47 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:48 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:49 INFO Client: Application report for application_1471848612199_0005 (state: ACCEPTED)
16/08/22 15:54:49 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:46225/user/YarnAM#289706976])
16/08/22 15:54:49 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> hadoop01, PROXY_URI_BASES -> http://hadoop01:8088/proxy/application_1471848612199_0005), /proxy/application_1471848612199_0005
16/08/22 15:54:49 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/08/22 15:54:50 INFO Client: Application report for application_1471848612199_0005 (state: RUNNING)
16/08/22 15:54:50 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.56.206
ApplicationMaster RPC port: 0
queue: root.hdfs
start time: 1471852483082
final status: UNDEFINED
tracking URL: http://hadoop01:8088/proxy/application_1471848612199_0005/
user: hdfs
16/08/22 15:54:50 INFO YarnClientSchedulerBackend: Application application_1471848612199_0005 has started running.
16/08/22 15:54:50 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38391.
16/08/22 15:54:50 INFO NettyBlockTransferService: Server created on 38391
16/08/22 15:54:50 INFO BlockManager: external shuffle service port = 7337
16/08/22 15:54:50 INFO BlockManagerMaster: Trying to register BlockManager
16/08/22 15:54:50 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.56.201:38391 with 534.5 MB RAM, BlockManagerId(driver, 192.168.56.201, 38391)
16/08/22 15:54:50 INFO BlockManagerMaster: Registered BlockManager
16/08/22 15:54:51 INFO EventLoggingListener: Logging events to hdfs://hadoop01:8020/user/spark/applicationHistory/application_1471848612199_0005
16/08/22 15:54:51 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
16/08/22 15:54:52 INFO MemoryStore: ensureFreeSpace(195280) called with curMem=0, maxMem=560497950
16/08/22 15:54:52 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 190.7 KB, free 534.3 MB)
16/08/22 15:54:52 INFO MemoryStore: ensureFreeSpace(22784) called with curMem=195280, maxMem=560497950
16/08/22 15:54:52 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.3 KB, free 534.3 MB)
16/08/22 15:54:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.56.201:38391 (size: 22.3 KB, free: 534.5 MB)
16/08/22 15:54:52 INFO SparkContext: Created broadcast 0 from textFile at WC.scala:17
16/08/22 15:54:52 INFO FileInputFormat: Total input paths to process : 1
16/08/22 15:54:52 INFO SparkContext: Starting job: collect at WC.scala:19
16/08/22 15:54:52 INFO DAGScheduler: Registering RDD 3 (map at WC.scala:19)
16/08/22 15:54:52 INFO DAGScheduler: Got job 0 (collect at WC.scala:19) with 2 output partitions
16/08/22 15:54:52 INFO DAGScheduler: Final stage: ResultStage 1(collect at WC.scala:19)
16/08/22 15:54:52 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/08/22 15:54:52 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/08/22 15:54:52 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WC.scala:19), which has no missing parents
16/08/22 15:54:52 INFO MemoryStore: ensureFreeSpace(4024) called with curMem=218064, maxMem=560497950
16/08/22 15:54:52 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 534.3 MB)
16/08/22 15:54:52 INFO MemoryStore: ensureFreeSpace(2281) called with curMem=222088, maxMem=560497950
16/08/22 15:54:52 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 534.3 MB)
16/08/22 15:54:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.56.201:38391 (size: 2.2 KB, free: 534.5 MB)
16/08/22 15:54:52 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/08/22 15:54:52 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WC.scala:19)
16/08/22 15:54:52 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/08/22 15:54:53 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/08/22 15:54:54 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 2)
16/08/22 15:54:59 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:59707/user/Executor#729574503]) with ID 1
16/08/22 15:54:59 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16/08/22 15:54:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop05, partition 0,NODE_LOCAL, 2186 bytes)
16/08/22 15:54:59 INFO BlockManagerMasterEndpoint: Registering block manager hadoop05:53273 with 534.5 MB RAM, BlockManagerId(1, hadoop05, 53273)
16/08/22 15:55:00 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop05:53273 (size: 2.2 KB, free: 534.5 MB)
16/08/22 15:55:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop05:53273 (size: 22.3 KB, free: 534.5 MB)
16/08/22 15:55:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop05, partition 1,NODE_LOCAL, 2186 bytes)
16/08/22 15:55:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3733 ms on hadoop05 (1/2)
16/08/22 15:55:03 INFO DAGScheduler: ShuffleMapStage 0 (map at WC.scala:19) finished in 10.621 s
16/08/22 15:55:03 INFO DAGScheduler: looking for newly runnable stages
16/08/22 15:55:03 INFO DAGScheduler: running: Set()
16/08/22 15:55:03 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/08/22 15:55:03 INFO DAGScheduler: failed: Set()
16/08/22 15:55:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 150 ms on hadoop05 (2/2)
16/08/22 15:55:03 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/08/22 15:55:03 INFO DAGScheduler: Missing parents for ResultStage 1: List()
16/08/22 15:55:03 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WC.scala:19), which is now runnable
16/08/22 15:55:03 INFO MemoryStore: ensureFreeSpace(2288) called with curMem=224369, maxMem=560497950
16/08/22 15:55:03 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.2 KB, free 534.3 MB)
16/08/22 15:55:03 INFO MemoryStore: ensureFreeSpace(1363) called with curMem=226657, maxMem=560497950
16/08/22 15:55:03 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1363.0 B, free 534.3 MB)
16/08/22 15:55:03 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.56.201:38391 (size: 1363.0 B, free: 534.5 MB)
16/08/22 15:55:03 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
16/08/22 15:55:03 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WC.scala:19)
16/08/22 15:55:03 INFO YarnScheduler: Adding task set 1.0 with 2 tasks
16/08/22 15:55:03 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, hadoop05, partition 0,PROCESS_LOCAL, 1950 bytes)
16/08/22 15:55:03 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop05:53273 (size: 1363.0 B, free: 534.5 MB)
16/08/22 15:55:03 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to hadoop05:59707
16/08/22 15:55:03 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 148 bytes
16/08/22 15:55:03 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, hadoop05, partition 1,PROCESS_LOCAL, 1950 bytes)
16/08/22 15:55:03 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 155 ms on hadoop05 (1/2)
16/08/22 15:55:03 INFO DAGScheduler: ResultStage 1 (collect at WC.scala:19) finished in 0.193 s
16/08/22 15:55:03 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 53 ms on hadoop05 (2/2)
16/08/22 15:55:03 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/08/22 15:55:03 INFO DAGScheduler: Job 0 finished: collect at WC.scala:19, took 11.041942 s
(張五,1)
(老王,2)
(張三,2)
(張四,1)
(王二,1)
(李四,3)
(李三,2)
16/08/22 15:55:03 INFO SparkUI: Stopped Spark web UI at http://192.168.56.201:4040
16/08/22 15:55:03 INFO DAGScheduler: Stopping DAGScheduler
16/08/22 15:55:03 INFO YarnClientSchedulerBackend: Interrupting monitor thread
16/08/22 15:55:03 INFO YarnClientSchedulerBackend: Shutting down all executors
16/08/22 15:55:03 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/08/22 15:55:03 INFO YarnClientSchedulerBackend: Stopped
16/08/22 15:55:03 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/08/22 15:55:03 INFO MemoryStore: MemoryStore cleared
16/08/22 15:55:03 INFO BlockManager: BlockManager stopped
16/08/22 15:55:03 INFO BlockManagerMaster: BlockManagerMaster stopped
16/08/22 15:55:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/08/22 15:55:03 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/08/22 15:55:03 INFO SparkContext: Successfully stopped SparkContext
16/08/22 15:55:03 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/08/22 15:55:03 INFO ShutdownHookManager: Shutdown hook called
16/08/22 15:55:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-bbf694e7-32e2-40b6-88a3-4d97a1d1aab9
執行成功。
相關推薦
spark從入門到放棄 之 分散式執行jar包
scala程式碼如下: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ /** * 統計字元出現
spark從入門到放棄十二: 深度剖析寬依賴與窄依賴
文章地址:http://www.haha174.top/article/details/256658 根據hello world 的例子介紹一個什麼是寬依賴和窄依賴。 窄依賴:英文全名,Narrow Dependence.什麼樣的情況,叫做窄依賴呢?一
spark從入門到放棄一: worldcount-java
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <spark.vers
spark從入門到放棄三十三:Spark Sql(6)hive sql 案例 查詢分數大於80分的同學
DROP TABLE IF EXISTS student_info"); sqlContext.sql("CREATE TABLE IF NOT EXISTS student_info (name STRING ,age INT)"); System.out.println(
spark從入門到放棄三十二:Spark Sql(5)hive sql 簡述
1 簡述 Spark Sql 支援對Hive 中儲存的資料進行讀寫。操作Hive中的資料時,可以建立HiveContext,而不是SqlContext.HiveContext 繼承自SqlContext,但是增加了在Hive元資料庫中查詢表,以及用Hi
spark從入門到放棄六: RDD 持久化原理
文章地址:http://www.haha174.top/article/details/252484 spark 中一個非常重要的功能特性就是可以將RDD 持久化到記憶體中。當對RDD進行持久化操作時,每個節點都會將自己操作的RDD的partition持久化
Spark從入門到放棄---RDD
什麼是Spark? 關於Spark具體的定義,大家可以去閱讀官網或者百度關於Spark的詞條,在此不再贅述。從一個野生程式猿的角度去理解,作為大資料時代的一個準王者,Spark是一款主流的高效能分散式計算大資料框架之一,和MapReduce,Hive,Flink等其他大資料框架一起支撐了大資料處理方案的一片
Spark視頻教程|Spark從入門到上手實戰
sparkSpark從入門到上手實戰網盤地址:https://pan.baidu.com/s/1uLUPAwsw8y7Ha1qWGjNx7A 密碼:m8l2備用地址(騰訊微雲):https://share.weiyun.com/55RRnmc 密碼:8qywnp 課程是2017年錄制,還比較新,還是有學習的價
Spark從入門到精通(一)
什麼是Spark 大資料計算框架 離線批處理 大資料體系架構圖(Spark) Spark包含了大資料領域常見的各種計算框架:比如Spark Core用於離線計算,Spark SQL用於互動式查詢,Spark Streaming用於實時流式計算,Spark MLib用於機器學習,Spark
大資料基礎之如何匯出jar包並放在hdfs上執行
我口才不好,文字描述也不行,但是基本邏輯是通的。 匯出jar包1.首先完成mapper和reducer還有main方法的編碼2。右鍵點選peopleinfo的包,選擇export-》Java-》JAR file,點選NEXT3.輸入jar包名稱以及匯出地址,點選next->next4.點選Browse
Spark從入門到精通六------RDD的運算元
RDD程式設計API RDD運算元 運算元是RDD中定義的方法,分為轉換(transformantion)和動作(action)。Tranformation運算元並不會觸發Spark提交作業,直至Action運算元才提交任務執行,這是一個延遲計算的設計技巧,
Spark-1.6.0之Application執行資訊記錄器JobProgressListener
JobProgressListener類是Spark的ListenerBus中一個很重要的監聽器,可以用於記錄Spark任務的Job和Stage等資訊,比如在Spark UI頁面上Job和Stage執行狀況以及執行進度的顯示等資料,就是從JobProgres
Spark從入門到精通五----RDD的產生背景---建立方式及分割槽說明
交流QQ: 824203453 彈性分散式資料集RDD RDD概述 產生背景 為了解決開發人員能在大規模的叢集中以一種容錯的方式進行記憶體計算,提出了RDD的概念,而當前的很多框架對迭代式演算法場景與互動性資料探勘場景的處理效能非常
spark-submit執行jar包指令碼命令
找到spark-submit檔案的目錄 目錄/spark-submit --master spark://192.168.172.10:7077 --executor-memory 2g --tota
Spark從入門到精通三------scala版本的wordcount---java版本的wordcount----java-lambda版本的wordcount
交流QQ: 824203453 spark shell僅在測試和驗證我們的程式時使用的較多,在生產環境中,通常會在IDE中開發程式,然後打成jar包,然後提交到叢集,最常用的是建立一個Maven專案,利用Maven來管理jar包的依賴。 交流QQ:
java命令執行jar包的方式
運行 -c 必須 loader 自定義 pan ati tcl stat 大家都知道一個java應用項目可以打包成一個jar,當然你必須指定一個擁有main函數的main class作為你這個jar包的程序入口。 具體的方法是修改jar包內目錄META-INF下的MA
python放棄之 模塊和包
jpg calling 目的 方式 ima 必須 功能 str ron import print(‘frrom the my_module.py‘) money=1000 def rend1(): print(‘my_my_module->reand1->mon
Maven倉庫理解、如何引入本地包、Maven多種方式打可執行jar包
依賴 tro 個人 部署 格式 多種方式 ava null 路徑 轉載博客:http://quicker.iteye.com/blog/2319947 有關MAVEN倉庫的理解參見:http://blog.csdn.net/wanghantong/article/det
Java打包可執行jar包 包含外部文件
star bsp end clas adl pro readline 令行 inpu 外部文件在程序中設置成相對當前工程路徑,執行jar包時,將外部文件放在和jar包平級的目錄。 1 public class Main { 2 3 4 public
eclipse怎麽導出可執行jar包
hot exp 對話框 con java程序 -i -c image jar 在eclpse中找到你要導出的java程序 選中它 單擊文件 -->export 在彈出的export對話框中找到 jar File 單擊選中-->