Spark Streaming基礎原理

阿新 • • 發佈：2019-02-02

What is Spark Streaming
作為UC Berkeley雲端計算software stack的一部分，Spark Streaming是建立在Spark上的應用框架，利用Spark的底層框架作為其執行基礎，並在其上構建了DStream的行為抽象。利用DStream所提供的api，使用者可以在資料流上實時進行count，join，aggregate等操作。

A Spark Streaming application is very similar to a Spark application; it consists of a driver program that runs the user’s main function and continuous executes various parallel operations on input streams of data. The main abstraction Spark Streaming provides is a discretized stream (DStream), which is a continuous sequence of RDDs (distributed collections of elements) representing a continuous stream of data. DStreams can be created from live incoming data (such as data from a socket, Kafka, etc.) or can be generated by transformong existing DStreams using parallel operators like map, reduce, and window.

複製程式碼

How to Use Spark Streaming
作為構建於Spark之上的應用框架，Spark Streaming承襲了Spark的程式設計風格，對於瞭解Spark的使用者來說能夠快速地上手。接下來以word count為例來介紹Spark Streaming的使用方式:

import spark.streaming.{Seconds, StreamingContext}
import spark.streaming.StreamingContext._
...
// Create the context and set up a network input stream to receive from a host:port

val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))
val lines = ssc.socketTextStream(args(1), args(2).toInt)
// Split the lines into words, count them, and print some of the counts on the master
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()
// Start the computation
ssc.start()

複製程式碼

1、建立StreamingContext物件

同Spark初始需要建立SparkContext物件一樣，使用Spark Streaming就需要建立StreamingContext物件。建立StreamingContext物件所需的引數與SparkContext基本一致，包括指明master，設定名稱(如NetworkWordCount)。需要注意的是引數Seconds(1)，Spark Streaming需要指定處理資料的時間間隔，如上例所示的1s，那麼Spark Streaming會以1s為時間視窗進行資料處理。此引數需要根據使用者的需求和叢集的處理能力進行適當的設定。

2、建立InputDStream

如同Storm的Spout，Spark Streaming需要指明資料來源。如上例所示的socketTextStream，Spark Streaming以socket連線作為資料來源讀取資料。當然Spark Streaming支援多種不同的資料來源，包括kafkaStream，flumeStream，fileStream， networkStream等。

3、操作DStream

對於從資料來源得到的DStream，使用者可以在其基礎上進行各種操作，如上例所示的操作就是一個典型的word count執行流程：對於當前時間視窗內從資料來源得到的資料首先進行分割，然後利用MapReduce演算法對映和計算，當然最後還有print()輸出結果。

4、啟動Spark Streaming

之前所作的所有步驟只是建立了執行流程，程式沒有真正連線上資料來源，也沒有對資料進行任何操作，只是設定好了所有的執行計劃，當ssc.start()啟動後程序才真正進行所有預期的操作。

至此對於Spark Streaming的如何使用有了一個大概的印象，接下來我們來探究一下Spark Streaming背後的程式碼。

Spark Streaming 原始碼分析
StreamingContext
Spark Streaming使用StreamingContext提供對外介面，使用者可以使用StreamingContext提供的api來構建自己的Spark Streaming應用程式。

StreamingContext內部維護SparkContext例項，通過SparkContext進行RDD的操作。
在例項化StreamingContext時需要指定batchDuration,用來指示Spark Streaming recurring job的重複時間。
StreamingContext提供了多種不同的介面，可以從多種資料來源建立DStream。
StreamingContext提供了起停streaming job的api。

DStream
Spark Streaming是建立在Spark基礎上的，它封裝了Spark的RDD並在其上抽象了流式的資料表現形式DStream：

A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data. DStreams can either be created from live data (such as, data from HDFS, Kafka or Flume) or it can be generated by transformation existing DStreams using operations such as map, window and reduceByKeyAndWindow. While a Spark Streaming program is running, each DStream periodically generates a RDD, either from live data or by transforming the RDD generated by a parent DStream

複製程式碼

DStream內部主要結構如下所示:

abstract class DStream[T: ClassManifest] (
@transient protected[streaming] var ssc: StreamingContext
) extends Serializable with Logging {
initLogging()
// =======================================================================
// Methods that should be implemented by subclasses of DStream
// =======================================================================
/** Time interval after which the DStream generates a RDD */
def slideDuration: Duration
/** List of parent DStreams on which this DStream depends on */
def dependencies: List[DStream[_]]
/** Method that generates a RDD for the given time */
/** DStream的核心函式，每一個繼承於此的子類都需要實現此compute()函式。而根據不同的
DStream， compute()函式都需要實現其特定功能，而計算的結果則是返回計算好的RDD*/
def compute (validTime: Time): Option[RDD[T]]
// =======================================================================
// Methods and fields available on all DStreams
// =======================================================================
// RDDs generated, marked as protected[streaming] so that testsuites can access it
/** 每一個DStream內部維護的RDD HashMap，DStream本質上封裝了一組以Time為key的RDD，而對於
DStream的各種操作在內部對映為對RDD的操作 */
@transient
protected[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
// Time zero for the DStream
protected[streaming] var zeroTime: Time = null
// Duration for which the DStream will remember each RDD created
protected[streaming] var rememberDuration: Duration = null
// Storage level of the RDDs in the stream
protected[streaming] var storageLevel: StorageLevel = StorageLevel.NONE
// Checkpoint details
protected[streaming] val mustCheckpoint = false
protected[streaming] var checkpointDuration: Duration = null
protected[streaming] val checkpointData = new DStreamCheckpointData(this)
// Reference to whole DStream graph
/** 所有的DStream都註冊到DStreamGraph中，呼叫DStreamGraph來執行所有的DStream和所有的dependencies */
protected[streaming] var graph: DStreamGraph = null
protected[streaming] def isInitialized = (zeroTime != null)
// Duration for which the DStream requires its parent DStream to remember each RDD created
protected[streaming] def parentRememberDuration = rememberDuration
...

複製程式碼

DStream在內部維護了一組時間序列的RDD，對於DStream的transformation和output在內部都轉化為對於RDD的transformation和output。

下面來看一下對於DStream的計算是如何對映到對於RDD的計算上去的。

protected[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
// If this DStream was not initialized (i.e., zeroTime not set), then do it
// If RDD was already generated, then retrieve it from HashMap
generatedRDDs.get(time) match {
// If an RDD was already generated and is being reused, then
// probably all RDDs in this DStream will be reused and hence should be cached
case Some(oldRDD) => Some(oldRDD)
// if RDD was not generated, and if the time is valid
// (based on sliding time of this DStream), then generate the RDD
case None => {
if (isTimeValid(time)) {
/** 對於每一次的計算，DStream會呼叫子類所實現的compute()函式來計算產生新的RDD */
compute(time) match {
case Some(newRDD) =>
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logInfo("Persisting RDD " + newRDD.id + " for time " + time + " to " + storageLevel + " at time " + time)
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf (checkpointDuration)) {
newRDD.checkpoint()
logInfo("Marking RDD " + newRDD.id + " for time " + time + " for checkpointing at time " + time)
}
/** 新產生的RDD會放入Hash Map中 */
generatedRDDs.put(time, newRDD)
Some(newRDD)
case None =>
None
}
} else {
None
}
}
}
}

複製程式碼

通過每次提交的job，呼叫getOrCompute()來計算:

protected[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) => {
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
}
case None => None
}
}

複製程式碼

Job & Scheduler
從DStream可知，在呼叫generateJob()時，DStream會通過getOrCompute()函式來計算或是轉換DStream，那麼Spark Streaming會在何時呼叫generateJob()呢?

在例項化StreamingContext時，StreamingContext會要求使用者設定batchDuration，而batchDuration則指明瞭recurring job的重複時間，在每個batchDuration到來時都會產生一個新的job來計算DStream，從Scheduler的程式碼裡可以看到：

val clockClass = System.getProperty("spark.streaming.clock", "spark.streaming.util.SystemClock")
val clock = Class.forName(clockClass).newInstance().asInstanceOf[Clock]
/** Spark streaming在Scheduler內部建立了recurring timer，recurring timer的超時時間
則是使用者設定的batchDuration，在超時後呼叫Scheduler的generateJob */
val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => generateJobs(new Time(longTime)))
generateJobs()的程式碼如下所示，Scheduler的generateJobs()會呼叫DStreamGraph的generateJobs，並對於每一個job使用JobManager來run job。
def generateJobs(time: Time) {
SparkEnv.set(ssc.env)
logInfo("\n-----------------------------------------------------\n")
graph.generateJobs(time).foreach(jobManager.runJob)
latestTime = time
doCheckpoint(time)
}

複製程式碼

在DStreamGraph中，generateJobs()如下所示：

def generateJobs(time: Time): Seq[Job] = {
this.synchronized {
logInfo("Generating jobs for time " + time)
val jobs = outputStreams.flatMap(outputStream => outputStream.generateJob(time))
logInfo("Generated " + jobs.length + " jobs for time " + time)
jobs
}
}

複製程式碼

對於每一個outputStream呼叫generateJob()來轉換或計算DStream，output的計算會依賴於dependecy的計算，因此最後會對所有dependency都進行計算，得出最後的outputStream。

而所有的這些操作，都在呼叫StreamingContext的啟動函式後進行執行。

def start() {
if (checkpointDir != null && checkpointDuration == null && graph != null) {
checkpointDuration = graph.batchDuration
}
validate()
/** StreamingContext註冊和啟動所有的input stream */
val networkInputStreams = graph.getInputStreams().filter(s => s match {
case n: NetworkInputDStream[_] => true
case _ => false
}).map(_.asInstanceOf[NetworkInputDStream[_]]).toArray
if (networkInputStreams.length > 0) {
// Start the network input tracker (must start before receivers)
networkInputTracker = new NetworkInputTracker(this, networkInputStreams)
networkInputTracker.start()
}
Thread.sleep(1000)
// 啟動scheduler進行streaming的操作
scheduler = new Scheduler(this)
scheduler.start()
}

複製程式碼

至此，對於Spark Streaming的使用和內部結構應該有了一個基本的瞭解，以一副Spark Streaming啟動後的流程圖來結束這篇文章。

Spark Streaming基礎原理

Spark Streaming基礎原理

spark streaming基礎知識1

Spark Streaming基礎總結

Spark基本原理概念以及 spark streaming 核心原理及實踐

Spark Streaming基礎簡介

Spark一些基礎原理——Job

Spark一些基礎原理——資源排程

Spark一些基礎原理——Cache

Spark一些基礎原理——Shuffle

Spark Streaming工作原理分析與使用

Spark入門之八：Spark Streaming 的原理以及應用場景介紹

Spark Streaming原理

Spark Streaming（02）——Spark Streaming原理

Spark Streaming 輸入DStream之基礎資料來源HDFS檔案

Spark Streaming ：基本工作原理

Spark學習（八）---Spark streaming原理

Spark Streaming對Exactly Once的實現原理

spark 基礎七 spark streaming

Spark Streaming初步使用以及工作原理詳解

Spark入門實戰系列--7.Spark Streaming（上）--實時流計算Spark Streaming原理介紹

Spark Streaming基礎原理

相關推薦