sbt構建一個spark工程(scala+spark+sbt)
準備工作,檔案結構如下:
(python2.7) [email protected]:~/Desktop/WordCount$ tree . ├── build.sbt ├── src │ └── main │ └── scala │ └── WordCount.scala 其中WordCount.scala如下:
import org.apache.spark._ import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]) { val inputFile = args(0) val outputFile = args(1) val conf = new SparkConf().setAppName("wordCount") // Create a Scala Spark Context. val sc = new SparkContext(conf) // Load our input data. val input = sc.textFile(inputFile) // Split up into words. val words = input.flatMap(line => line.split(" ")) // Transform into word and count. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, causing evaluation. counts.saveAsTextFile(outputFile) } }
build.sbt內容如下:
name := "learning-spark-mini-example"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0" % "provided"
注意,hdfs的檔案系統在linux下面是看不到的。 也就是說用hdfs dfs -mkdir在linux下面是看不到的。
注意這個實驗的輸入和輸出都在hdfs檔案系統中,linux下是看不到的。
下面是詳細的步驟
1. 注意執行這個例子前,還需要啟動hdfs,否則會connection refused 具體命令是: ./start-dfs.sh 然後jps命令看下,是否namenode和datanode都執行起來了。 2.把linux系統的檔案README.txt拷貝到HDFS系統中
hdfs dfs -mkdir /user/appleyuchi hdfs dfs -put README.txt /user/appleyuchi
3.sbt package
4. /home/appleyuchi/bigdata/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --class "WordCount" --master local /home/appleyuchi/Desktop/WordCount/target/scala-2.11/learning-spark-mini-example_2.11-1.0.jar hdfs://localhost:9000/user/appleyuchi/README.txt ./wordcounts
這裡注意下哈 hdfs://localhost:9000/user/appleyuchi/README.txt 這個東西不是linux上面的檔案路徑,是第2個步驟中,把linux的檔案傳入到hdfs系統中以後的路徑 也就是說這個檔案在spark處理時,linux下我們直接看是看不見的。
5. 把hdfs系統中的執行結果拷貝到linux系統中 hadoop fs -get /user/appleyuchi/wordcounts ~/wordcounts
cd ~/wordcounts cat part-00000 (Hadoop,1) (Commodity,1) (For,1) (this,3) (country,1) (under,1) (it,1) (The,4) (Jetty,1) (Software,2) (Technology,1) (<http://www.wassenaar.org/>,1) (have,1) (http://wiki.apache.org/hadoop/,1) (BIS,1) (classified,1) (This,1) (following,1) (which,2) (security,1) (See,1) (encryption,3) (Number,1) (export,1) (reside,1) (for,3) ((BIS),,1) (any,1) (at:,2) (software,2) (makes,1) (algorithms.,1) (re-export,2) (latest,1) (your,1) (SSL,1) (the,8) (Administration,1) (includes,2) (import,,2) (provides,1) (Unrestricted,1) (country's,1) (if,1) (740.13),1) (Commerce,,1) (country,,1) (software.,2) (concerning,1) (laws,,1) (source,1) (possession,,2) (Apache,1) (our,2) (written,1) (as,1) (License,1) (regulations,1) (libraries,1) (by,1) (please,2) (form,1) (BEFORE,1) (ENC,1) (code.,1) (both,1) (5D002.C.1,,1) (distribution,2) (visit,1) (is,1) (about,1) (website,1) (currently,1) (permitted.,1) (check,1) (Security,1) (Section,1) (on,2) (performing,1) ((see,1) (U.S.,1) (with,1) (in,1) ((ECCN),1) (object,1) (using,2) (cryptographic,3) (mortbay.org.,1) (and/or,1) (Department,1) (manner,1) (from,1) (Core,1) (has,1) (may,1) (Exception,1) (Industry,1) (restrictions,1) (details,1) (http://hadoop.apache.org/core/,1) (project,1) (you,1) (another,1) (or,2) (use,,2) (policies,1) (uses,1) (information,2) (Hadoop,,1) (to,2) (code,1) (software,,2) (Regulations,,1) (more,2) (software:,1) (see,1) (,18) (of,5) (wiki,,1) (Bureau,1) (Control,1) (exception,1) (Government,1) (eligible,1) (Export,2) (information.,1) (Foundation,1) (functions,1) (and,6) (included,1) ((TSU),1) (asymmetric,1)
也可以是到hdfs系統中去檢視結果,命令如下:
hdfs dfs -cat /user/appleyuchi/wordcounts/*
------------------------------------------------------------- 其他用到的命令: hdfs dfs -rmr input刪除根目錄下面的input資料夾 hdfs dfs -ls
所以來總結下: 先要啟動HDFS系統,然後輸入的資料檔案README.txt要傳入HDFS系統中,執行程式碼前還要用sbt解決依賴問題,最後執行該程式碼, 執行後的結果最初是放在HDFS系統中的,為了觀察結果,把結果從HDFS系統中轉移到linux系統中,最後才結束。
參考文獻: https://blog.csdn.net/coder__cs/article/details/78992764