Spark Streaming的視窗操作

阿新 • • 發佈：2019-01-07

Spark Streaming的Window Operation可以理解為定時的進行一定時間段內的資料的處理。

不要怪我語文不太好。。下面上原理圖吧，一圖勝千言：

如圖：

1. 紅色的矩形就是一個視窗，視窗hold的是一段時間內的資料流。

2.這裡面每一個time都是時間單元，在官方的例子中，每隔window size是3 time unit, 而且每隔2個單位時間，視窗會slide一次。

所以基於視窗的操作，需要指定2個引數：

window length - The duration of the window (3 in the figure)
slide interval - The interval at which the window-based operation is performed (2 in the figure).

1.視窗大小，個人感覺是一段時間內資料的容器。 2.滑動間隔，就是我們可以理解的cron表示式吧。 - -！舉個例子吧：還是以最著名的wordcount舉例，每隔10秒，統計一下過去30秒過來的資料。

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))

這裡的paris就是一個MapedRDD，類似(word,1)

reduceByKeyAndWindow // 這個類似RDD裡面的reduceByKey，就是對RDD應用function

在這裡是根據key，對至進行聚合，然後累加。下面貼上一下它的API，僅供參考：

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow (func, windowLength,slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

Output Operations

When an output operator is called, it triggers the computation of a stream. Currently the following output operators are defined:

Output Operation	Meaning
print()	Prints first ten elements of every batch of data in a DStream on the driver.
foreachRDD(func)	The fundamental output operator. Applies a function, func, to each RDD generated from the stream. This function should have side effects, such as printing output, saving the RDD to external files, or writing it over the network to an external system.
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as a `SequenceFile` of serialized objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

原創，轉載請註明出處http://blog.csdn.net/oopsoom/article/details/23776477

Spark 系列（十四）—— Spark Streaming 基本操作

一、案例引入這裡先引入一個基本的案例來演示流的建立：獲取指定埠上的資料並進行詞頻統計。專案依賴和程式碼實現如下： <dependency> <groupId>org.apache.spark</groupId> <artifactId>spar

Spark Streaming的視窗操作

Spark Streaming的Window Operation可以理解為定時的進行一定時間段內的資料的處理。不要怪我語文不太好。。下面上原理圖吧，一圖勝千言：如圖： 1. 紅色的矩形就是一個視窗，視窗hold的是一段時間內的資料流。 2.這裡面每一個time都是時間

IDEA Spark Streaming 操作(套接字流)-----make socket數據源

運行 accep index lines sleep div rri object art import java.io.PrintWriter import java.net.ServerSocket import scala.io.Source object D

Spark Streaming中的操作函數講解

csdn 後綴 rep 包含著所有並行計算技術分享 ref filter Spark Streaming中的操作函數講解根據根據Spark官方文檔中的描述，在Spark Streaming應用中，一個DStream對象可以調用多種操作，主要分為以下幾類 Tra

Spark Streaming筆記整理（三）：DS的transformation與output操作

job watermark number 這樣的格式 current fix work eat DStream的各種transformation Transformation Meaning map(func) 對DStream中的各個元素進行func函數操作，然後

Spark Streaming（03）——Dstream及相關操作

1、什麼是Dstream？ Dstream(Discretized Stream)是Spark Streaming的基礎抽象，代表持續性的資料流和經過各種Spark運算元操作後的結果資料流。在內部實現上，DStream是一系列連續的RDD來表示。每個RDD含有

Spark-Streaming之window滑動視窗應用

Spark-Streaming之window滑動視窗應用，Spark Streaming提供了滑動視窗操作的支援，從而讓我們可以對一個滑動視窗內的資料執行計算操作。每次掉落在視窗內的RDD的資料，會被聚合起來執行計算操作，然後生成的RDD，會作為window DStream的一個RDD。

Spark Streaming中的基本操作函式例項

該文例項我的碼雲直達車請了解一些基本資訊： DStream是Spark Streaming提供的基本抽象。它表示連續的資料流，可以是從源接收的輸入資料流，也可以是通過轉換輸入流生成的已處理資料流。在內部，DStream由一系列連續的RDD表示，這是Spar

Kakfka-Spark Streaming-Spark SQL操作筆記

（kafka版本與Spark版本在maven專案中有提到） linux下測試kafka 1.偽分散式下開啟kafka服務（啟動有先後順序）： nohup zookeeper-server-start.sh config/zookeeper.propert

Spark Streaming 實戰案例（三) DStream Window操作

本節主要內容 Window Operation入門案例1. Window Operation Spark Streaming提供視窗操作（Window Operation），如下圖所示：上圖中，紅色實線表示視窗當前的滑動位置，虛線表示前一次視窗位置，視窗每滑動一次，落在

Spark Streaming 滑動視窗

Spark Streaming提供了滑動視窗操作的支援，從而讓我們可以對一個滑動視窗內的資料執行計算操作。每次掉落在視窗

使用Spark Streaming SQL基於時間視窗進行資料統計

2.時間窗語法說明 Spark Streaming SQL支援兩類視窗操作：滾動視窗（TUMBLING）和滑動視窗（HOPPING）。 2.1滾動視窗滾動視窗（TUMBLING）根據每條資料的時間欄位將資料分配到一個指定大小的視窗中進行操作，視窗以視窗大小為步長進行滑動，視窗之間不會出現重疊。

【Spark深入學習 -15】Spark Streaming前奏-Kafka初體驗

rod htm 新的 callback tails 包括 -c 舉例清理 ----本節內容------- 1.Kafka基礎概念 1.1 出世背景 1.2 基本原理 1.2.1.前置知識 1.2.2.架構和原理 1.2.

<Spark Streaming><Flume><Integration>

uri min 取數 nts general ora span int from Overview Flume：一個分布式的，可靠的，可用的服務，用於有效地收集、聚合、移動大規模日誌數據我們搭建一個flume + Spark Streaming的平臺來從Flume獲取數

<Spark Streaming><本地調試>

pri lis pac flume object st2 soc port 打包寫在前面因為本地電腦沒裝flume，nginx各種。所以之前寫Streaming程序的時候，都是打包了放到集群上跑。就算我在程序代碼裏不停地logger，調試起來也hin不方便。於是本地

spark streaming基礎知識1

ati 發送數據沒有手動 rdd drive 入隊定期 1.怎麽理解spark streaming中的dstream? 它是spark streaming的基礎數據結構,代表著(time,RDD)序列,有兩種生成方式,一種是基於流數據創建(kafka,socket

Dataflow編程模型和spark streaming結合

而且拆分元組tuple ica 目前維度前景 fix 好的 Dataflow編程模型和spark streaming結合主要介紹一下Dataflow編程模型的基本思想，後面再簡單比較一下Spark streaming的編程模型 == 是什麽 ==

Spark Streaming從Kafka中獲取數據，並進行實時單詞統計，統計URL出現的次數

scrip 發送消息 rip mark 3.2 umt 過程 bject ttr 1、創建Maven項目創建的過程參考：http://blog.csdn.net/tototuzuoquan/article/details/74571374 2、啟動Kafka A:安裝ka

【Spark】RDD操作具體解釋4——Action算子

() sim comment cat zed ret form sdn order 本質上在Actions算子中通過SparkContext運行提交作業的runJob操作，觸發了RDD DAG的運行。依據Action算子的輸出空間將Action算子進

下載基於大數據技術推薦系統實戰教程(Spark ML Spark Streaming Kafka Hadoop Mahout Flume Sqoop Redis)

大數據技術推薦系統推薦系統實戰地址:http://pan.baidu.com/s/1c2tOtwc 密碼:yn2r82課高清完整版，轉一播放碼。互聯網行業是大數據應用最前沿的陣地，目前主流的大數據技術，包括 hadoop，spark等，全部來自於一線互聯網公司。從應用角度講，大數據在互聯網領域主

Spark Streaming的視窗操作

Output Operations

相關推薦