Spark Streaming的視窗操作
阿新 • • 發佈:2019-01-07
Spark Streaming的Window Operation可以理解為定時的進行一定時間段內的資料的處理。
不要怪我語文不太好。。下面上原理圖吧,一圖勝千言:
如圖:
1. 紅色的矩形就是一個視窗,視窗hold的是一段時間內的資料流。
2.這裡面每一個time都是時間單元,在官方的例子中,每隔window size是3 time unit, 而且每隔2個單位時間,視窗會slide一次。
所以基於視窗的操作,需要指定2個引數:
- window length - The duration of the window (3 in the figure)
- slide interval - The interval at which the window-based operation is performed (2 in the figure).
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
這裡的paris就是一個MapedRDD, 類似(word,1)
在這裡是根據key,對至進行聚合,然後累加。 下面貼上一下它的API,僅供參考:reduceByKeyAndWindow // 這個類似RDD裡面的reduceByKey,就是對RDD應用function
Transformation | Meaning |
---|---|
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength,slideInterval) | Return a sliding window count of elements in the stream. |
reduceByWindow |
Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks]) |
When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's
default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional numTasks argument to set a
different number of tasks. |
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks]) |
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce
values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However,
it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in reduceByKeyAndWindow ,
the number of reduce tasks is configurable through an optional argument. |
countByValueAndWindow(windowLength,slideInterval, [numTasks]) |
When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow ,
the number of reduce tasks is configurable through an optional argument. |
Output Operations
When an output operator is called, it triggers the computation of a stream. Currently the following output operators are defined:
Output Operation | Meaning |
---|---|
print() | Prints first ten elements of every batch of data in a DStream on the driver. |
foreachRDD(func) | The fundamental output operator. Applies a function, func, to each RDD generated from the stream. This function should have side effects, such as printing output, saving the RDD to external files, or writing it over the network to an external system. |
saveAsObjectFiles(prefix, [suffix]) |
Save this DStream's contents as a SequenceFile of serialized objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsTextFiles(prefix, [suffix]) | Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsHadoopFiles(prefix, [suffix]) | Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |