Spark流程式設計指引（五）-----------------------------DStreams上的轉換操作

阿新 • • 發佈：2019-01-12

與RDDs類似，轉換操作允許對來自輸入DStreams的資料進行修改。DStreams支援許多在通常Spark RDD上的轉換操作。下面是一些常見的：

轉換	含義
map(func)	Return a new DStream by passing each element of the source DStream through a functionfunc.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on whichfunc returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream andotherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a functionfunc (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property`spark.default.parallelism`) to do the grouping. You can pass an optional`numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

我們現在看一些值得討論的轉換操作：

UpdateStateByKey操作

UpdateStateByKey操作允許你保持任意的狀態，同時用持續不斷地新資訊更新它。為了使用它，你需要做2步：

1.定義狀態－－狀態可以是任意資料型別。

2.定義狀態更新函式－－定義一個函式，如何用先前的狀態和輸入流中的新值來更新狀態。

我們現在用一個例子來說明。你想要保持一個來自文字資料流的每個單詞的執行計數。這裡，執行計數就是狀態，它是一個整型。我這樣定義更新函式：

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

這個函式被應用到一個包含單詞的DStream(用第二節例子中的paris DStream，它包含了(word,1)的鍵值對）。

val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

這個函式將在每個單詞上呼叫，其中newValues引數是一個包含1的序列(來自(word,1)鍵值對），runningCount是前一次的計數。關於這個例子的Scala全部程式碼，可以檢視StatefulNetworkWordCount.scala.

注意：使用updateStateByKey需要配置一個checkpoint目錄。

Transform操作

Transform操作允許任意RDD-to-RDD型別的函式被應用在一個DStream上。通過它可以在DStream上使用任何沒有在DStream API中暴露的任意RDD操作。比如，將DStream的每批資料加入另一個數據集的功能在DStream API中沒有直接暴躁。但是，我們可以很容易地通過transform做到這一點。Transform使很多強大的功能變為可能。再比如，你想實時地清理加入到輸入DStream中的垃圾郵件資訊，並過濾它們。

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information

val cleanedDStream = wordCounts.transform(rdd => {
  rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
  ...
})

事實上，你還可以在transform方法中應用機器學習和圖計算演算法。

Window操作

Spark流還提供了視窗計算操作，它允許你在一個滑動視窗的資料應用轉換操作。如下圖所示：

如上圖所示，視窗在源DStream上滑動的每個時間點，在視窗中的RDDs被組合和計算用來產生出基於Window Dstream的RDDs.在這個例子中，操作被應用在最近三個時間單元中的資料，被劃分成了2個時間段。每個視窗操作需要指定兩個引數：

1.視窗長度：視窗的持續時間（圖中是3個時間單元）

2.滑動區間：視窗操作應用的區間（圖中是2個區間）

這兩個引數必須是源DStream批次間隔的倍數（圖中的批次間隔為1）

下面用一個例子來描述。我們仍拿前面章節中計算單詞數的例子舉例。我們現在要統計過去30s內每個單詞的個數，每10s統計一次。我們需要在過去30s的(word,1)的DStream變數pairs上使用reduceByKey操作.要做到這些，需要使用操作reduceByKeyAndWindow

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

下面是一些常用的Window操作，所有操作都用到了上面所說的兩個引數：

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval usingfunc. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce functionfunc over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property`spark.default.parallelism`) to do the grouping. You can pass an optional`numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength,slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameter invFunc. Like in`reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that [checkpointing](#checkpointing) must be enabled for using this operation.
countByValueAndWindow(windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in`reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

Join操作

最後，我們關注一下怎樣容易地執行不同型別的Join操作。

Stream-stream joins

Stream可以很容易地加入到其它Stream中：

val stream1: DStream[String, String] = ...
val stream2: DStream[String, String] = ...
val joinedStream = stream1.join(stream2)

如上面所示，在資料流的每個批次上，由stream1產生的RDD與stream2產生的RDD聯合起來。你也可以使用leftOuterJoin,rightOuterJoin, fullOuterJoin。

更進一步，將流的每個視窗聯合通常更加有用。這也很簡單：

val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)

Stream-dataset joins

在前面DStream.transform中已經介紹過。這裡是另外一個例子:

val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }

事實上，你可以動態地改變你join的資料集。transform提供的函式每一個批次會被執行，使用的是當前的資料集。

Spark流程式設計指引（五）-----------------------------DStreams上的轉換操作

Transform操作

Spark流程式設計指引（五）-----------------------------DStreams上的轉換操作

Spark流程式設計指引(四)---------------------------DStreams基本模型，輸入DStreams和接收者

Java程式設計思想（五）—— 初始化與清理

Java程式設計思想（五）第18章-Java IO系統

Java網路程式設計入門（五）之TCP程式設計——複用Socket連線

資料結構程式設計回顧（五）交通諮詢系統設計

網路程式設計筆記（五）回聲伺服器/客戶端

《Python高階程式設計》（五）元類

音視訊開發——音訊流解碼播放（五）

Spark流程式設計指引(三)-------------------------------------初始化StreamingContext

Spark Streaming程式設計指南（三）

Linux下的socket程式設計實踐（五）設定套接字I/O超時的方案

c# 程式設計學習（五）

Spark SQL原始碼解析（五）SparkPlan準備和執行階段

Python入門篇（五）之文件操作和字符編碼

TestNG（五）常用元素的操作

輕量ORM-SqlRepoEx （五）存儲過程操作

elasticsearch（五）java 使用批量操作bulk及注意事項

selenium模組（五）：元素互動操作

Spring Boot 最佳實踐（五）Spring Data JPA 操作 MySQL 8

Spark流程式設計指引（五）-----------------------------DStreams上的轉換操作

Transform操作

相關推薦