Apache Flink® 入門介紹以及程式設計模型

阿新 • • 發佈：2019-02-05

在業餘時間經常來Flink官網查閱文件，零零散散的看不成系統，最近打算系統的做一下筆記。有官網主頁我們便知道Flink是什麼？

Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

Flink是一個開源的分散式，高效能的，總是可用的，精確的流處理框架。

Introduction to Apache Flink®

下面是Flink和流處理的一個巨集觀概述，我們先看一下關於Flink的一個概念：

Dataflow Programming Model (資料流程式設計模型)

Levels of Abstraction(抽象層次)

flink支援不同層次的抽象開發流處理或批處理

狀態流處理：底層介面，構建處理的資料Block Core APIs：資料流/資料集的處理介面 Table API：宣告特定域指定語言 SQL：高層次的語言

Programs and Dataflows

Flink構建程式塊的基本方式就是使用streams和transformations（注意：DataSet在內部實現也是使用的streams的API）。

概念上講stream就是一個流資料記錄，transformation就是將一個或者多個stream作為輸入進行處理的操作運算元，然後會產生一個或者多個輸出streams作為結果。當程式執行，Flink程式被對映成streaming dataflows，該streaming dataflows由stream和transformation運算元組成。每一個dataflow由一個或者多個輸入源並且有一個或者多個輸出。dataflows類似於DAG(directed acyclic graphs).

通常在Flink程式中，一個tramsformation和一個dataflow的操作符是一一對應的，但是有時候一個transformation可能包含多個transformation。

Parallel Dataflows

Flink程式必定是並行分散式執行，執行期間，一個stream有一個或者多個stream分割槽，每一個操作符有一個分割槽或者operator subtasks，operator subtasks之間是相互獨立的，執行在不同的執行緒中，或者不同的機器上或者容器中。 operator subtasks的數量是該操作符的並行度。一個stream的並行度由產生該stream的運算元決定，同一個程式的不同操作符的並行度可能不同。

One-to-one streams (for example between the Source and the map() operators in the figure above) preserve the partitioning and ordering of the elements.
Redistributing streams (as between map() and keyBy/window above, as well as between keyBy/window and Sink) change the partitioning of streams.

Windows

聚合事件在流上的處理不同於批處理。例如：由於流是無限的資料，因此就無法執行count計算所有元素個數，相反在count,sum運算元可以計算在視窗上，例如：計算前五分鐘或者計算上100個元素。視窗大小可以有時間確定或者元素個數確定（例如上100個元素）。滑動視窗有：滾動視窗（沒有重疊），滑動視窗（有重疊），和會話視窗(punctuated by a gap of inactivity).

Time

當在streaming程式中提time一般指的是如下不同概念： Event Time ：時間時間一般是指時間建立的時間，通常由一個時間戳描述事件，Flink訪問時間戳可以通過
Ingestion time：攝入時間，一般是指一個時間進入Flink dataflow的源操作符時間
Processing Time ：處理時間是本地時間，每一個操作符基於時間操作執行的時間
More details on how to handle time are in the event time docs.

Stateful Operations

然而許多操作在dataflow看起來只是一個事件，但是有些操作需要記住多個事件的資訊，這些操作我們稱為有狀態的。有狀態的操作的狀態我們可以認為是儲存在一個key/value的容器中。這些狀態和讀取狀態的操作符儲存在一起。

Checkpoints for Fault Tolerance

Flink實現stream repay和checkpoint實現容錯，checkpoint可以理解為一個時間點的備份，當出錯時候可以根據該點checkpoint進行恢復。

Batch on Streaming

批處理是streaming的一個特例，批處理中stream是一個有邊界的資料集，該資料集可以被看做是stream data，上述適用於流處理的概念都適用於批處理，但是有幾個特例： 1：DataSet API不需要使用checkpoint，因為資料是有邊界的，故障護恢復可以重新計算進行恢復，使用checkpoint恢復的成本一般大於常規的計算恢復成本。（對於DataSet的checkpoint的恢復成本較高） 2：Stateful operations in the DataSet API use simplified in-memory/out-of-core data structures, rather than key/value indexes（沒有理解）. 3：The DataSet API introduces special synchronized (superstep-based) iterations, which are only possible on bounded streams. For details, check out the iteration docs（沒有理解）.

Apache Flink® 入門介紹以及程式設計模型

Dataflow Programming Model (資料流程式設計模型)

Levels of Abstraction(抽象層次)

Programs and Dataflows

Parallel Dataflows

Windows

Time

Stateful Operations

Checkpoints for Fault Tolerance

Batch on Streaming

Apache Flink® 入門介紹以及程式設計模型

spark-sql的概述以及程式設計模型的介紹

JDBC 入門介紹以及 DAO 模式簡單介紹

轉載：Spark中文指南(入門篇)-Spark程式設計模型(一)

Apache Flink 入門示例demo

Apache Flink-程式設計指南-概念-程式設計模型

Flink說明文件介紹---資料流程式設計模型

《Apache Flink官方文件》程式設計模型

谷歌DataFlow程式設計模型以及Spark/Flink/StreamCQL的相關實現

Apache Flink 官方文件翻譯之程式設計模型

Apache Flink 1.3.0正式發布及其新功能介紹

【ASP.NET Core快速入門】（九） RoutingMiddleware介紹以及MVC引入

[Flink基本概念與部署]--Flink 程式設計模型【一】

spring—boot入門簡單介紹以及簡單配置

《從0到1學習Flink》—— Apache Flink 介紹

從Storm到Flink：大資料處理的開源系統及程式設計模型（文末福利）

Netty深入分析與Dubbo實戰解析（一）——網路程式設計模型介紹

（三）：Flink資料流程式設計模型

《Apache Flink官方文件》 Apache Flink介紹

Spark：Spark 程式設計模型及快速入門

Apache Flink® 入門介紹以及程式設計模型

Dataflow Programming Model (資料流程式設計模型)

Levels of Abstraction(抽象層次)

Programs and Dataflows

Parallel Dataflows

Windows

Time

Stateful Operations

Checkpoints for Fault Tolerance

Batch on Streaming

相關推薦