sbt構建一個spark工程（scala+spark+sbt）

阿新 • • 發佈：2018-12-21

準備工作，檔案結構如下：

(python2.7) [email protected]:~/Desktop/WordCount$ tree . ├── build.sbt ├── src │ └── main │ └── scala │ └── WordCount.scala 其中WordCount.scala如下：

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
    def main(args: Array[String]) {
      val inputFile = args(0)
      val outputFile = args(1)
      val conf = new SparkConf().setAppName("wordCount")
      // Create a Scala Spark Context.
      val sc = new SparkContext(conf)
      // Load our input data.
      val input =  sc.textFile(inputFile)
      // Split up into words.
      val words = input.flatMap(line => line.split(" "))
      // Transform into word and count.
      val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
      // Save the word count back out to a text file, causing evaluation.
      counts.saveAsTextFile(outputFile)
    }
}

build.sbt內容如下：

name := "learning-spark-mini-example"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0" % "provided"

注意，hdfs的檔案系統在linux下面是看不到的。也就是說用hdfs dfs -mkdir在linux下面是看不到的。

注意這個實驗的輸入和輸出都在hdfs檔案系統中，linux下是看不到的。

下面是詳細的步驟

1. 注意執行這個例子前，還需要啟動hdfs，否則會connection refused 具體命令是： ./start-dfs.sh 然後jps命令看下，是否namenode和datanode都執行起來了。 2.把linux系統的檔案README.txt拷貝到HDFS系統中

hdfs dfs -mkdir /user/appleyuchi hdfs dfs -put README.txt /user/appleyuchi

3.sbt package

4. /home/appleyuchi/bigdata/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --class "WordCount" --master local /home/appleyuchi/Desktop/WordCount/target/scala-2.11/learning-spark-mini-example_2.11-1.0.jar hdfs://localhost:9000/user/appleyuchi/README.txt ./wordcounts

這裡注意下哈 hdfs://localhost:9000/user/appleyuchi/README.txt 這個東西不是linux上面的檔案路徑，是第2個步驟中，把linux的檔案傳入到hdfs系統中以後的路徑也就是說這個檔案在spark處理時，linux下我們直接看是看不見的。

5. 把hdfs系統中的執行結果拷貝到linux系統中 hadoop fs -get /user/appleyuchi/wordcounts ~/wordcounts

cd ~/wordcounts cat part-00000 (Hadoop,1) (Commodity,1) (For,1) (this,3) (country,1) (under,1) (it,1) (The,4) (Jetty,1) (Software,2) (Technology,1) (<http://www.wassenaar.org/>,1) (have,1) (http://wiki.apache.org/hadoop/,1) (BIS,1) (classified,1) (This,1) (following,1) (which,2) (security,1) (See,1) (encryption,3) (Number,1) (export,1) (reside,1) (for,3) ((BIS),,1) (any,1) (at:,2) (software,2) (makes,1) (algorithms.,1) (re-export,2) (latest,1) (your,1) (SSL,1) (the,8) (Administration,1) (includes,2) (import,,2) (provides,1) (Unrestricted,1) (country's,1) (if,1) (740.13),1) (Commerce,,1) (country,,1) (software.,2) (concerning,1) (laws,,1) (source,1) (possession,,2) (Apache,1) (our,2) (written,1) (as,1) (License,1) (regulations,1) (libraries,1) (by,1) (please,2) (form,1) (BEFORE,1) (ENC,1) (code.,1) (both,1) (5D002.C.1,,1) (distribution,2) (visit,1) (is,1) (about,1) (website,1) (currently,1) (permitted.,1) (check,1) (Security,1) (Section,1) (on,2) (performing,1) ((see,1) (U.S.,1) (with,1) (in,1) ((ECCN),1) (object,1) (using,2) (cryptographic,3) (mortbay.org.,1) (and/or,1) (Department,1) (manner,1) (from,1) (Core,1) (has,1) (may,1) (Exception,1) (Industry,1) (restrictions,1) (details,1) (http://hadoop.apache.org/core/,1) (project,1) (you,1) (another,1) (or,2) (use,,2) (policies,1) (uses,1) (information,2) (Hadoop,,1) (to,2) (code,1) (software,,2) (Regulations,,1) (more,2) (software:,1) (see,1) (,18) (of,5) (wiki,,1) (Bureau,1) (Control,1) (exception,1) (Government,1) (eligible,1) (Export,2) (information.,1) (Foundation,1) (functions,1) (and,6) (included,1) ((TSU),1) (asymmetric,1)

也可以是到hdfs系統中去檢視結果，命令如下：

hdfs dfs -cat /user/appleyuchi/wordcounts/*

------------------------------------------------------------- 其他用到的命令： hdfs dfs -rmr input刪除根目錄下面的input資料夾 hdfs dfs -ls

所以來總結下：先要啟動HDFS系統，然後輸入的資料檔案README.txt要傳入HDFS系統中，執行程式碼前還要用sbt解決依賴問題，最後執行該程式碼，執行後的結果最初是放在HDFS系統中的，為了觀察結果，把結果從HDFS系統中轉移到linux系統中，最後才結束。

參考文獻： https://blog.csdn.net/coder__cs/article/details/78992764

sbt構建一個spark工程（scala+spark+sbt）

sbt構建一個spark工程（scala+spark+sbt）

IntelliJ IDEA構建基於maven的spark+hbase工程（scala語言）

多層感知機（MLP）演算法原理及Spark MLlib呼叫例項（Scala/Java/Python）

二十種特徵變換方法及Spark MLlib呼叫例項（Scala/Java/python）（一）

MLlib--多層感知機（MLP）演算法原理及Spark MLlib呼叫例項（Scala/Java/Python）

三種特徵選擇方法及Spark MLlib呼叫例項（Scala/Java/python）

隨機森林迴歸（Random Forest）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

二十種特徵變換方法及Spark MLlib呼叫例項（Scala/Java/python）（二）

梯度迭代樹（GBDT）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

idea，maven打包spark程式（scala）的jar檔案

spark 運算元（Transformation and Action）

Spark Streaming（二十四）初識

利用maven專案建立一個web專案工程（圖文詳解）

Spark Streaming（二十七）DStream的轉換、輸出、快取持久化、檢查點

最近有時間回顧一下2年前用這個Springboot的點點滴滴構建一個Springboot 工程

Introducing Apache Spark Datasets（中英雙語）

jenkins安裝，構建第一個工程（更新程式碼到工作空間）

SBT（Scala Build Tool）系列教程-入門知識

Linux下使用make構建一個C++工程專案

構建一個雜誌佈局（譯文）

sbt構建一個spark工程（scala+spark+sbt）

相關推薦