SparkStreaming 核心概念與程式設計

阿新 • • 發佈：2019-01-17

核心概念：1）StreamingContext2）DStream（batched RDDs）3）Input DStream 、Receiver4）Transformation、Output Operation（RDD的轉化、行動操作）StreamingContext常用構造方法：def this(sparkContext: SparkContext, batchDuration: Duration) = { this(sparkContext, null, batchDuration)}def this(conf: SparkConf, batchDuration: Duration) = { this(StreamingContext.createNewSparkContext(conf), null, batchDuration)}batch interval可以根據你的應用程式需求的延遲要求以及叢集可用的資源情況來設定一旦StreamingContext定義好之後，就可以做一些事情

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and output operations to DStreams.（類似RDD）
Start receiving data and processing it using streamingContext.start().
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
The processing can be manually stopped using streamingContext.stop().

注意:

Once a context has been started, no new streaming computations（計算） can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext.
- To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.

Discretized（離散） Streams (DStreams)Internally, a DStream is represented by a continuous series（一系列） of RDDsEach RDD in a DStream contains data from a certain interval對DStream操作運算元，比如map/flatMap，其實底層會被翻譯為對DStream中的每個RDD都做相同的操作；因為一個DStream是由不同批次的RDD所構成的。Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.These underlying RDD transformations are computed by the Spark engineInput DStreams and ReceiversEvery input DStream (except file stream)is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.Points to remember（注意）

When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL.
- Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data.
- Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
Extending the logic to running on a cluster,
- the number of cores allocated to the Spark Streaming application must be more than the number of receivers.
- Otherwise the system will receive data, but not be able to process it.

Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, etc. are available through extra utility classes.

For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as via StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass].Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.

A simple directory can be monitored, such as "hdfs://namenode:8040/logs/". All files directly under such a path will be processed as they are discovered.
A POSIX glob pattern can be supplied, such as "hdfs://namenode:8040/logs/2017/*". Here, the DStream will consist of all files in the directories matching the pattern.
- That is: it is a pattern of directories, not of files in directories.
All files must be in the same data format.
A file is considered part of a time period based on its modification time, not its creation time.
Once processed, changes to a file within the current window will not cause the file to be reread.
- That is: updates are ignored.
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
If a wildcard is used to identify directories, such as "hdfs://namenode:8040/logs/2016-*", renaming an entire directory to match the path will add the directory to the list of monitored directories. Only the files in the directory whose modification time is within the current window will be included in the stream.
Calling FileSystem.setTimes() to fix the timestamp is a way to have the file picked up in a later window, even if its contents have not changed.

SparkStreaming 核心概念與程式設計

SparkStreaming 核心概念與程式設計

Spark學習（玖）- Spark Streaming核心概念與程式設計

Spark Streaming核心概念與編程

Strom核心元件與程式設計模型

Docker 容器的核心概念與使用

Elasticsearch最佳實踐之核心概念與原理

區塊鏈技術核心概念與原理理解

第2章 Docker核心概念與安裝--2.2 Linux 各發行版核心升級與Docker 安裝

SparkStreaming 運行原理與核心概念

Flink SQL 核心概念剖析與程式設計案例實戰

ElasticSearch筆記整理（二）：CURL操作、ES插件、集群安裝與核心概念

Spark筆記整理（二）：RDD與spark核心概念名詞

[Flink基本概念與部署]--Flink 程式設計模型【一】

Maven基本概念與核心配置

Mave基本概念與核心配置

Java多執行緒程式設計總結筆記——03概念與原理

Git基本概念與核心命令掌握

js核心概念之apply與call

迭代器概念與Traits程式設計技法

shell程式設計技術和例項--《linux0.01核心分析與作業系統設計》

SparkStreaming 核心概念與程式設計

相關推薦