提交第一個spark作業到叢集執行
阿新 • • 發佈:2019-01-04
寫在前面
接觸spark有一段時間了,但是一直都沒有真正意義上的在叢集上面跑自己編寫的程式碼。今天在本地使用scala編寫一個簡單的WordCount程式。然後,打包提交到叢集上面跑一下…
在本地使用idea開發,由於這個程式比較簡單,我這裡就直接給出程式碼。
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setAppName("WordCount" );
val sc=new SparkContext(conf)
val input=sc.textFile("/home/hadoop/data/test1.txt")
val lines=input.flatMap(line=>line.split(" "))
val count=lines.map(word=>(word,1)).reduceByKey{case(x,y)=>x+y}
val output=count.saveAsTextFile("/home/hadoop/data/output")
}
}
程式碼,寫完之後,就是打包成一個jar檔案
接著,上傳生成的架包到叢集
[hadoop@hadoop000 jars]$ rz
[hadoop@hadoop000 jars]$ ls
scalafirst.jar
[hadoop@hadoop000 jars]$
我們的架包上傳好了之後,我們就可以啟動spark叢集了
在開始之前,先來檢視一下需要統計的檔案:
啟動master
[[email protected] sbin]$ pwd
/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/sbin
[[email protected] sbin]$ ./start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-hadoop000.out
[[email protected] sbin]$
檢視結果:
[hadoop@hadoop000 sbin]$ jps
25266 Master
25336 Jps
22815 SparkSubmit
[hadoop@hadoop000 sbin]$
可以看見master啟動成功
啟動worker
[[email protected] spark-2.2.0-bin-2.6.0-cdh5.7.0]$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://hadoop000:7077
檢視結果:
[hadoop@hadoop000 ~]$ jps
25266 Master
25356 Worker
25421 Jps
22815 SparkSubmit
[hadoop@hadoop000 ~]$
上面的worker也是成功啟動了
提交作業,計算結果
[hadoop@hadoop000 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ ./bin/spark-submit --master spark://hadoop000:7077 --class WordCount /home/hadoop/jars/scalafirst.jar
17/12/02 23:05:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/02 23:05:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:> (0 + 0) / 2[Stage 0:> (0 + 1) / 2[Stage 0:> (0 + 2) / 2[Stage 0:=============================> (1 + 1) / 2[Stage 1:> (0 + 0) / 2[Stage 1:> (0 + 1) / 2[Stage 1:=============================> (1 + 1) / 2 [hadoop@hadoop000 spark-2.2.0-bin-2.6.0-cdh5.7.0]$
檢視結果:
[hadoop@hadoop000 data]$ pwd
/home/hadoop/data
[hadoop@hadoop000 data]$ cd output/
[hadoop@hadoop000 output]$ ls
part-00000 part-00001 _SUCCESS
[hadoop@hadoop000 output]$ cat part-00000
(hive,1)
(,1)
(hello,5)
(kafka,1)
(sqoop,1)
[hadoop@hadoop000 output]$ cat part-00001
(spark,1)
(hadoop,1)
(flume,1)
(hbase,1)
[hadoop@hadoop000 output]$
可以參照之前的:
好的,到這裡,我們的統計就已經完成了,可以看見結果也是沒有問題的。就這樣簡單的三個步驟我們就在叢集上面跑了我們的第一個程式。如果,你是初學者,不妨一試喲。。