1. 程式人生 > >sbt構建一個spark工程(scala+spark+sbt)

sbt構建一個spark工程(scala+spark+sbt)

準備工作,檔案結構如下:

(python2.7) [email protected]:~/Desktop/WordCount$ tree . ├── build.sbt ├── src │        └── main │                     └── scala │                                 └── WordCount.scala 其中WordCount.scala如下:  

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
    def main(args: Array[String]) {
      val inputFile = args(0)
      val outputFile = args(1)
      val conf = new SparkConf().setAppName("wordCount")
      // Create a Scala Spark Context.
      val sc = new SparkContext(conf)
      // Load our input data.
      val input =  sc.textFile(inputFile)
      // Split up into words.
      val words = input.flatMap(line => line.split(" "))
      // Transform into word and count.
      val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
      // Save the word count back out to a text file, causing evaluation.
      counts.saveAsTextFile(outputFile)
    }
}

build.sbt內容如下:

name := "learning-spark-mini-example"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0" % "provided"

注意,hdfs的檔案系統在linux下面是看不到的。 也就是說用hdfs dfs -mkdir在linux下面是看不到的。

注意這個實驗的輸入和輸出都在hdfs檔案系統中,linux下是看不到的。

下面是詳細的步驟

1. 注意執行這個例子前,還需要啟動hdfs,否則會connection refused 具體命令是: ./start-dfs.sh 然後jps命令看下,是否namenode和datanode都執行起來了。 2.把linux系統的檔案README.txt拷貝到HDFS系統中

hdfs dfs -mkdir /user/appleyuchi hdfs dfs -put README.txt /user/appleyuchi

3.sbt package

4. /home/appleyuchi/bigdata/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --class "WordCount" --master local /home/appleyuchi/Desktop/WordCount/target/scala-2.11/learning-spark-mini-example_2.11-1.0.jar hdfs://localhost:9000/user/appleyuchi/README.txt ./wordcounts

這裡注意下哈 hdfs://localhost:9000/user/appleyuchi/README.txt 這個東西不是linux上面的檔案路徑,是第2個步驟中,把linux的檔案傳入到hdfs系統中以後的路徑 也就是說這個檔案在spark處理時,linux下我們直接看是看不見的。

5. 把hdfs系統中的執行結果拷貝到linux系統中 hadoop fs -get /user/appleyuchi/wordcounts ~/wordcounts

cd ~/wordcounts cat part-00000 (Hadoop,1) (Commodity,1) (For,1) (this,3) (country,1) (under,1) (it,1) (The,4) (Jetty,1) (Software,2) (Technology,1) (<http://www.wassenaar.org/>,1) (have,1) (http://wiki.apache.org/hadoop/,1) (BIS,1) (classified,1) (This,1) (following,1) (which,2) (security,1) (See,1) (encryption,3) (Number,1) (export,1) (reside,1) (for,3) ((BIS),,1) (any,1) (at:,2) (software,2) (makes,1) (algorithms.,1) (re-export,2) (latest,1) (your,1) (SSL,1) (the,8) (Administration,1) (includes,2) (import,,2) (provides,1) (Unrestricted,1) (country's,1) (if,1) (740.13),1) (Commerce,,1) (country,,1) (software.,2) (concerning,1) (laws,,1) (source,1) (possession,,2) (Apache,1) (our,2) (written,1) (as,1) (License,1) (regulations,1) (libraries,1) (by,1) (please,2) (form,1) (BEFORE,1) (ENC,1) (code.,1) (both,1) (5D002.C.1,,1) (distribution,2) (visit,1) (is,1) (about,1) (website,1) (currently,1) (permitted.,1) (check,1) (Security,1) (Section,1) (on,2) (performing,1) ((see,1) (U.S.,1) (with,1) (in,1) ((ECCN),1) (object,1) (using,2) (cryptographic,3) (mortbay.org.,1) (and/or,1) (Department,1) (manner,1) (from,1) (Core,1) (has,1) (may,1) (Exception,1) (Industry,1) (restrictions,1) (details,1) (http://hadoop.apache.org/core/,1) (project,1) (you,1) (another,1) (or,2) (use,,2) (policies,1) (uses,1) (information,2) (Hadoop,,1) (to,2) (code,1) (software,,2) (Regulations,,1) (more,2) (software:,1) (see,1) (,18) (of,5) (wiki,,1) (Bureau,1) (Control,1) (exception,1) (Government,1) (eligible,1) (Export,2) (information.,1) (Foundation,1) (functions,1) (and,6) (included,1) ((TSU),1) (asymmetric,1)

也可以是到hdfs系統中去檢視結果,命令如下:

hdfs dfs -cat /user/appleyuchi/wordcounts/*

------------------------------------------------------------- 其他用到的命令: hdfs dfs -rmr input刪除根目錄下面的input資料夾 hdfs dfs -ls

所以來總結下: 先要啟動HDFS系統,然後輸入的資料檔案README.txt要傳入HDFS系統中,執行程式碼前還要用sbt解決依賴問題,最後執行該程式碼, 執行後的結果最初是放在HDFS系統中的,為了觀察結果,把結果從HDFS系統中轉移到linux系統中,最後才結束。

參考文獻: https://blog.csdn.net/coder__cs/article/details/78992764