Spark Streaming 實戰案例（三) DStream Window操作

阿新 • • 發佈：2019-02-15

本節主要內容

Window Operation
入門案例

1. Window Operation

Spark Streaming提供視窗操作（Window Operation），如下圖所示：
這裡寫圖片描述
上圖中，紅色實線表示視窗當前的滑動位置，虛線表示前一次視窗位置，視窗每滑動一次，落在該視窗中的RDD被一起同時處理，生成一個視窗DStream（windowed DStream），視窗操作需要設定兩個引數：
（1）視窗長度（window length），即視窗的持續時間，上圖中的視窗長度為3
（2）滑動間隔（sliding interval），視窗操作執行的時間間隔，上圖中的滑動間隔為2
這兩個引數必須是原始DStream 批處理間隔（batch interval）的整數倍（上圖中的原始DStream的batch interval為1）

2. 入門案例

WindowWordCount——reduceByKeyAndWindow方法使用

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object WindowWordCount {
  def main(args: Array[String]) {
    //傳入的引數為localhost 9999 
 30 10
    if (args.length != 4) {
      System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
      System.exit(1)
    }
    StreamingExamples.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("WindowWordCount").setMaster 
("local[4]")
    val sc = new SparkContext(conf)

    // 建立StreamingContext，batch interval為5秒
    val ssc = new StreamingContext(sc, Seconds(5))


    //Socket為資料來源
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)

    val words = lines.flatMap(_.split(" "))

    // windows操作，對視窗中的單詞進行計數
    val wordCounts = words.map(x => (x , 1)).reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(args(2).toInt), Seconds(args(3).toInt))

    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

通過下列程式碼啟動netcat server

root@sparkmaster:~# nc -lk 9999

再執行WindowWordCount
輸入下列語句

root@sparkmaster:~# nc -lk 9999
Spark is a fast and general cluster computing system for Big Data. It provides

觀察執行情況：

-------------------------------------------
Time: 1448778805000 ms（10秒，第一個滑動視窗時間）
-------------------------------------------
(provides,1)
(is,1)
(general,1)
(Big,1)
(fast,1)
(cluster,1)
(Data.,1)
(computing,1)
(Spark,1)
(a,1)
...

-------------------------------------------
Time: 1448778815000 ms（10秒後，第二個滑動視窗時間）
-------------------------------------------
(provides,1)
(is,1)
(general,1)
(Big,1)
(fast,1)
(cluster,1)
(Data.,1)
(computing,1)
(Spark,1)
(a,1)
...

-------------------------------------------
Time: 1448778825000 ms（10秒後，第三個滑動視窗時間）
-------------------------------------------
(provides,1)
(is,1)
(general,1)
(Big,1)
(fast,1)
(cluster,1)
(Data.,1)
(computing,1)
(Spark,1)
(a,1)
...

-------------------------------------------
Time: 1448778835000 ms（再經10秒後，超出window length視窗長度，不在計數範圍內）
-------------------------------------------

-------------------------------------------
Time: 1448778845000 ms
-------------------------------------------

同樣的語句輸入兩次

[email protected]:~# nc -lk 9999
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides

觀察執行結果如下：

Time: 1448779205000 ms
-------------------------------------------
(provides,2)
(is,2)
(general,2)
(Big,2)
(fast,2)
(cluster,2)
(Data.,2)
(computing,2)
(Spark,2)
(a,2)
...

再輸入一次

[email protected]:~# nc -lk 9999
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides

計算結果如下：


-------------------------------------------
Time: 1448779215000 ms
-------------------------------------------
(provides,3)
(is,3)
(general,3)
(Big,3)
(fast,3)
(cluster,3)
(Data.,3)
(computing,3)
(Spark,3)
(a,3)
...

再輸入一次

[email protected]:~# nc -lk 9999
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides
Spark is a fast and general cluster computing system for Big Data. It provides

計算結果如下：


-------------------------------------------
Time: 1448779225000 ms
-------------------------------------------
(provides,4)
(is,4)
(general,4)
(Big,4)
(fast,4)
(cluster,4)
(Data.,4)
(computing,4)
(Spark,4)
(a,4)
...

-------------------------------------------
Time: 1448779235000 ms
-------------------------------------------
(provides,2)
(is,2)
(general,2)
(Big,2)
(fast,2)
(cluster,2)
(Data.,2)
(computing,2)
(Spark,2)
(a,2)
...

-------------------------------------------
Time: 1448779245000 ms
-------------------------------------------
(provides,1)
(is,1)
(general,1)
(Big,1)
(fast,1)
(cluster,1)
(Data.,1)
(computing,1)
(Spark,1)
(a,1)
...

-------------------------------------------
Time: 1448779255000 ms
-------------------------------------------

-------------------------------------------
Time: 1448779265000 ms
-------------------------------------------

2 WindowWordCount——countByWindow方法使用


import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object WindowWordCount {
  def main(args: Array[String]) {
    if (args.length != 4) {
      System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
      System.exit(1)
    }
    StreamingExamples.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("WindowWordCount").setMaster("local[2]")
    val sc = new SparkContext(conf)

    // 建立StreamingContext
    val ssc = new StreamingContext(sc, Seconds(5))
    // 定義checkpoint目錄為當前目錄
    ssc.checkpoint(".")


    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)
    val words = lines.flatMap(_.split(" "))

    //countByWindowcountByWindow方法計算基於滑動視窗的DStream中的元素的數量。
    val countByWindow=words.countByWindow(Seconds(args(2).toInt), Seconds(args(3).toInt))

    countByWindow.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

啟動

root@sparkmaster:~# nc -lk 9999

然後執行WindowWordCount
輸入

root@sparkmaster:~# nc -lk 9999
Spark is a fast and general cluster computing system for Big Data

察看執行結果：

-------------------------------------------
Time: 1448780625000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1448780635000 ms
-------------------------------------------
12

-------------------------------------------
Time: 1448780645000 ms
-------------------------------------------
12

-------------------------------------------
Time: 1448780655000 ms
-------------------------------------------
12

-------------------------------------------
Time: 1448780665000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1448780675000 ms
-------------------------------------------
0

3 WindowWordCount——reduceByWindow方法使用

//reduceByWindow方法基於滑動視窗對源DStream中的元素進行聚合操作，返回包含單元素的一個新的DStream。
 val reduceByWindow=words.map(x=>1).reduceByWindow(_+_,_-_Seconds(args(2).toInt), Seconds(args(3).toInt))

上面的例子其實是countByWindow的實現，可以在countByWindow原始碼實現中得到驗證

def countByWindow(
      windowDuration: Duration,
      slideDuration: Duration): DStream[Long] = ssc.withScope {
    this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
  }

而reduceByWindow又是通過reduceByKeyAndWindow方法來實現的，具體程式碼如下

def reduceByWindow(
      reduceFunc: (T, T) => T,
      invReduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[T] = ssc.withScope {
      this.map(x => (1, x))
          .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
          .map(_._2)
  }

與前面的例子中的reduceByKeyAndWindow方法不同的是這裡的reduceByKeyAndWindow方法多了一個invReduceFunc引數，方法完整原始碼如下：

 /**
   * Return a new DStream by applying incremental `reduceByKey` over a sliding window.
   * The reduced value of over a new window is calculated using the old window's reduced value :
   *  1. reduce the new values that entered the window (e.g., adding new counts)
   *
   *  2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
   *
   * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
   * However, it is applicable to only "invertible reduce functions".
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param reduceFunc associative reduce function
   * @param invReduceFunc inverse reduce function
   * @param windowDuration width of the window; must be a multiple of this DStream's
   *                       batching interval
   * @param slideDuration  sliding interval of the window (i.e., the interval after which
   *                       the new DStream will generate RDDs); must be a multiple of this
   *                       DStream's batching interval
   * @param filterFunc     Optional function to filter expired key-value pairs;
   *                       only pairs that satisfy the function are retained
   */
  def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      invReduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration = self.slideDuration,
      numPartitions: Int = ssc.sc.defaultParallelism,
      filterFunc: ((K, V)) => Boolean = null
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(
      reduceFunc, invReduceFunc, windowDuration,
      slideDuration, defaultPartitioner(numPartitions), filterFunc
    )
  }

具體來講，下面兩個方法得到的結果是一樣的，只是效率不同，後面的方法方式效率更高：

//以過去5秒鐘為一個輸入視窗，每1秒統計一下WordCount，本方法會將過去5秒鐘的每一秒鐘的WordCount都進行統計
//然後進行疊加，得出這個視窗中的單詞統計。 這種方式被稱為疊加方式，如下圖左邊所示
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, Seconds(5s)，seconds(1))

與

//計算t+4秒這個時刻過去5秒視窗的WordCount，可以將t+3時刻過去5秒的統計量加上[t+3，t+4]的統計量
//再減去[t-2，t-1]的統計量，這種方法可以複用中間三秒的統計量，提高統計的效率。 這種方式被稱為增量方式，如下圖的右邊所示
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(5s)，seconds(1))

這裡寫圖片描述

DStream支援的全部Window操作方法如下：
這裡寫圖片描述

Spark Streaming 實戰案例（三) DStream Window操作

本節主要內容

1. Window Operation

2. 入門案例

Spark Streaming 實戰案例（三) DStream Window操作

Spark Streaming筆記整理（三）：DS的transformation與output操作

Spark Streaming程式設計指南（三）

Java多線程編程模式實戰指南（三）：Two-phase Termination模式

Spring Boot實戰筆記（三）-- Spring常用配置（Bean的初始化和銷毀、Profile）

shell腳本案例（三）利用top命令精確監控cpu使用率

Spark SQL筆記整理（三）：加載保存功能與Spark SQL函數

scala spark-streaming整合kafka （spark 2.3 kafka 0.10）

spark RDD常用運算元（三）

大資料之電話日誌分析callLog案例（三）

.Net Core 在 Linux-Centos上的部署實戰教程（三）

python實戰演練（三）購物車程序

MySql必知必會實戰練習（三）資料過濾 MySql必知必會實戰練習（二）資料檢索

Spark採坑系列（三）Spark操作Hive的坑

Java 由淺入深GUI程式設計實戰練習（三）

Spark基礎-scala學習（三）

微服務架構實戰篇（三）：Spring boot2.0 + Mybatis + PageHelper實現增刪改查和分頁查詢功能

《Python3網絡爬蟲實戰案例（崔慶才著）》中文版PDF下載，附源代碼+視頻教程

Netty實戰手冊（三）

實戰演練（三）

Spark Streaming 實戰案例（三) DStream Window操作

本節主要內容

1. Window Operation

2. 入門案例

相關推薦