Watermarking in Spark Structured Streaming

阿新 • • 發佈：2018-12-28

What is a Watermark?

Watermarking is a useful method which helps a Stream Processing Engine to deal with lateness. Basically, a watermark is a threshold to specify how long the system waits for late events. If an arriving event lies within the watermark, it gets used to update a query. Otherwise, if it’s older than the watermark, it will be dropped and not further processed by the Streaming Engine

But, Why Should I Care?

In distributed and networked systems, there’s always a chance for disruption — nodes going down, sensors are loosing connection and so on and so forth. Because of that, it’s not guaranteed that data will arrive in a Stream Processing Engine in the order they were created. For the sake of fault tolerance

it’s therefore necessary to handle such Out-of-Order data.

To deal with this problem, the state of an aggregate must be preserved. In case a late event occurs, the query can then be reprocessed. But this means that the state of all aggregates must kept indefinitely, which causes the memory usage to grow to indefinitely too. And that is not practical in a real world scenario, unless the system has unlimited resources (resp. an unlimited budget). Therefore watermarking

is a useful concept to constrain the system by design and prevent it from exploding at runtime.

How to use it?

Since Spark 2.1, watermarking is introduced into Structured Streaming API. You can enable it by simply adding the withWatermark-Operator to a query:

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

It takes two Parameters, a) an event time column (must be the same as the aggregate is working on) and b) a threshold to specify for how long late data should be processed (in event time unit). The state of an aggregate will then be maintained by Spark until max eventTime — delayThreshold > T , where max eventTime is the latest event time seen by the engine and T is the starting time of a window. If late data fall within this threshold, the query gets updated eventually (right image in the figure below). Otherwise it gets dropped and no reprocessing is triggered (left image in figure below).

Watermarking in Spark Structured Streaming

What is a Watermark?Watermarking is a useful method which helps a Stream Processing Engine to deal with lateness. Basically, a watermark is a threshold to

Spark Structured Streaming框架(5)之進程管理

ntp 框架 manager lis ive term red ogr pan 　　Structured Streaming提供一些API來管理Streaming對象。用戶可以通過這些API來手動管理已經啟動的Streaming，保證在系統中的Streaming有序執行。

從零開始寫一個Spark Structured Streaming程式來統計單詞個數

本文將從零開始寫一個Spark Structured Streaming程式來統計單詞的個數。單詞的來源是socket，讀者也可以換成kafka，計算的結果輸出到控制檯，讀者也可以改成輸出到kafka的某個topic。準備環境： JDK和Scala安裝，並配置好環境變數JAVA_H

Spark2.2（三十三）：Spark Streaming和Spark Structured Streaming更新broadcast總結

背景：需要在spark2.2.0更新broadcast中的內容，網上也搜尋了不少文章，都在講解spark streaming中如何更新，但沒有spark structured streaming更新broadcast的用法，於是就這幾天進行了反覆測試。經過了一下兩個測試：：Spark Streaming更

Spark2.3（三十四）：Spark Structured Streaming之withWaterMark和windows視窗是否可以實現最近一小時統計

WaterMark除了可以限定來遲資料範圍，是否可以實現最近一小時統計？ WaterMark目的用來限定引數計算資料的範圍：比如當前計算資料內max timestamp是12::00，waterMark限定資料分為是60 minutes，那麼如果此時輸入11:00之前的資料就會被捨棄不參與統計，視為來遲範圍

Spark2.3（三十五）Spark Structured Streaming原始碼剖析(從CSDN中看到別人分析的原始碼的文章值得收藏)

從CSDN中讀取到關於spark structured streaming原始碼分析不錯的幾篇文章 spark原始碼分析--事件匯流排LiveListenerBus spark事件匯流排的核心是LiveListenerBus，其內部維護了多個AsyncEventQueue佇列用於儲存和分發

Spark Structured Streaming + Kafka使用筆記

這篇部落格將會記錄Structured Streaming + Kafka的一些基本使用(Java 版) spark 2.3.0 1. 概述 Structured Streaming （結構化流）是一種基於 Spark SQL 引擎構建的可擴充套件且容錯的 st

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

在spark中《Memory usage of state in Spark Structured Streaming》講解Spark記憶體分配情況，以及提到了HDFSBackedStateStoreProvider儲存多個版本的影響；從stackoverflow上也可以看到別人遇到了structured s

Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming

This post is a token of appreciation for the amazing open source community of Data Science, to which I owe a lot of what I have learned. For last few month

Spark Structured Streaming入門程式設計指南

Spark結構式流程式設計指南概覽Structured Streaming 是一個可拓展，容錯的，基於Spark SQL執行引擎的流處理引擎。使用小量的靜態資料模擬流處理。伴隨流資料的到來，Spark SQL引擎會逐漸連續處理資料並且更新結果到最終的Table中。你可以在Sp

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（二十三）Structured Streaming遇到問題：Set(TopicName-0) are gone. Some data may have been missed

ack loss set div top 過程 pan check use 事情經過：之前該topic(M_A)已經存在，而且正常消費了一段時間，後來刪除了topic(M_A)，重新創建了topic（M-B），程序使用新創建的topic（M-B）進行實時統計操作，執行過程中

Watermarking in Spark Structured Streaming

What is a Watermark?

But, Why Should I Care?

How to use it?

Watermarking in Spark Structured Streaming

Spark Structured Streaming框架(5)之進程管理

從零開始寫一個Spark Structured Streaming程式來統計單詞個數

Spark2.2（三十三）：Spark Streaming和Spark Structured Streaming更新broadcast總結

Spark2.3（三十四）：Spark Structured Streaming之withWaterMark和windows視窗是否可以實現最近一小時統計

Spark2.3（三十五）Spark Structured Streaming原始碼剖析(從CSDN中看到別人分析的原始碼的文章值得收藏)

Spark Structured Streaming + Kafka使用筆記

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming

Spark Structured Streaming入門程式設計指南

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（二十三）Structured Streaming遇到問題：Set(TopicName-0) are gone. Some data may have been missed

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（二十五）Structured Streaming：同一個topic中包含一組數據的多個部分，按照key它們拼接為一條記錄（以及遇到的問題）。

谷歌Dataflow程式設計模型和spark 2.0 structured streaming

Spark定製班第29課：深入理解Spark 2.x中的Structured Streaming內幕

Spark Streaming vs. Structured Streaming

Structured-Streaming之窗口操作

How to convert matrix to RDD[Vector] in spark

ANN in spark MLLib

Exception in Spark

Structured Streaming + Kafka Integration Guide 結構化流+Kafka集成指南 (Kafka broker version 0.10.0 or higher)

Watermarking in Spark Structured Streaming

What is a Watermark?

But, Why Should I Care?

How to use it?

相關推薦