Getting Streaming data from Kafka with Spark Streaming using Python.

阿新 • • 發佈：2018-12-29

Getting Streaming data from Kafka with Spark Streaming using Python.

If you are looking to use spark to perform data transformation and manipulation when data ingested using Kafka, then you are at right place.

In this article, we going to look at Spark Streaming and this is one of several other libraries exposed by Spark Platform.

Spark Streaming provides a way to process unbound data commonly known as streaming data. You can read more at “https://spark.apache.org/docs/latest/streaming-programming-guide.html".

Using Spark we can do processing within-batch and in the batch. The benefit we can get using Spark stream processing is taking an action while an event is occurring.

The use case I am showing here is very simple unbound data read from Kafka topic. Here I just print the message as a simple string.

Preparing the Environment

We need to make sure that packages we use are available to Spark. Instead of downloading jar files and worrying about paths, we can instead use the — packages option and specify the group/artifact/version based on what’s available on Maven and Spark will handle the downloading. We specify PYSPARK_SUBMIT_ARGS for this to get passed correctly when executing from within Python Command shell. There are two options to work with pyspark below.

1. Install pyspark using pip

2. Use findspark library if you have Spark running.

I am choosing option 2 for now as I am running HDP2.6 at my end.

import osimport findsparkfindspark.init('/usr/hdp/2.5.6.0-40/spark')

Import dependencies

We need to import the necessary pySpark modules for Spark, Spark Streaming, and Spark Streaming with Kafka.

from pyspark import SparkContext

from pyspark.streaming import StreamingContext

from pyspark.streaming.kafka import KafkaUtils

Create Spark context

The Spark context is the primary object under which everything else is called. The setLogLevel call is optional.

sc = SparkContext(appName="PythonSparkStreamingKafka")

sc.setLogLevel("WARN")

Create Streaming Context

We pass the Spark context (from above) along with the batch duration which here is set to 60 seconds.

ssc = StreamingContext(sc,60)

Connect to Kafka

Using the native Spark Streaming Kafka capabilities, we use the streaming context from above to connect to our Kafka cluster.

kafkaStream = KafkaUtils.createStream(ssc, 'victoria.com:2181', 'spark-streaming', {'imagetext':1})

Message Processing

The inbound stream is a DStream, which supports various built-in transformations such as map.

lines = kafkaStream.map(lambda x: x[1])

lines.pprint()

Start the streaming context

Having defined the streaming context, now we’re ready to actually start it! When you run this cell, the program will start, and you’ll see the result of all the pprint functions above appear in the output to this cell below.

ssc.start()

ssc.awaitTermination()

You can find the full code on My GitHub Repo.

Getting Streaming data from Kafka with Spark Streaming using Python.

Getting Streaming data from Kafka with Spark Streaming using Python.If you are looking to use spark to perform data transformation and manipulation when da

java實現kafka整合spark streaming完成wordCount,updateStateByKey完成實時狀態更新

引入依賴 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId&g

整合Kafka到Spark Streaming——程式碼示例和挑戰

作者Michael G. Noll是瑞士的一位工程師和研究員，效力於Verisign，是Verisign實驗室的大規模資料分析基礎設施（基礎Hadoop）的技術主管。本文，Michael詳細的演示瞭如何將Kafka整合到Spark Streaming中。期間， Mich

Kafka結合Spark-streaming 的兩種連線方式(AWL與直連)

kafka結合spark-streaming的用法及說明之前部落格有些，這裡就不贅述了。這篇文章說下他們結合使用的兩種連線方式。(AWL與直連) 先看一張圖：這是kafka與streaming結合的基本方式，如圖spark叢集中的 worker節點中 exeutor

[Nuxt] Load Data from APIs with Nuxt and Vuex

his pro -- http template map https etc not run In a server-rendered application, if you attempt to load data before the page renders and

How to Scale Machine Learning Data From Scratch With Python

Tweet Share Share Google Plus Many machine learning algorithms expect data to be scaled consiste

Spark定製班第1課：通過案例對Spark Streaming透徹理解三板斧之一：解密Spark Streaming另類實驗及Spark Streaming本質解析

package com.dt.spark.streaming import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.Seco

Building our data science platform with Spark and Jupyter

Testing while documentingAll critical paths of the code are covered with Integration Tests using Python Doctest framework, enabling up-to-date and accurate

第124課：Spark Streaming效能優化：通過Spark Streaming進行裝置日誌監控報警及效能優化

通過Spark Streaming進行裝置日誌監控報警及效能優化 1、Spark Streaming進行裝置監控及報警 2、Spark Streaming進行裝置監控效能優化 ELK Stack：一整套開源的日誌處理平臺解決方案，可以集日誌的採集、檢索、視

Offset Management For Apache Kafka With Apache Spark Streaming

ould cond eth ref properly fine load them sca An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（二十三）Structured Streaming遇到問題：Set(TopicName-0) are gone. Some data may have been missed

ack loss set div top 過程 pan check use 事情經過：之前該topic(M_A)已經存在，而且正常消費了一段時間，後來刪除了topic(M_A)，重新創建了topic（M-B），程序使用新創建的topic（M-B）進行實時統計操作，執行過程中

Getting Streaming data from Kafka with Spark Streaming using Python.

Getting Streaming data from Kafka with Spark Streaming using Python.

Getting Streaming data from Kafka with Spark Streaming using Python.

java實現kafka整合spark streaming完成wordCount,updateStateByKey完成實時狀態更新

整合Kafka到Spark Streaming——程式碼示例和挑戰

Kafka結合Spark-streaming 的兩種連線方式(AWL與直連)

[Nuxt] Load Data from APIs with Nuxt and Vuex

How to Scale Machine Learning Data From Scratch With Python

Spark定製班第1課：通過案例對Spark Streaming透徹理解三板斧之一：解密Spark Streaming另類實驗及Spark Streaming本質解析

Building our data science platform with Spark and Jupyter

第124課：Spark Streaming效能優化：通過Spark Streaming進行裝置日誌監控報警及效能優化

Offset Management For Apache Kafka With Apache Spark Streaming

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（二十三）Structured Streaming遇到問題：Set(TopicName-0) are gone. Some data may have been missed

【Spark深入學習 -15】Spark Streaming前奏-Kafka初體驗

Spark Streaming從Kafka中獲取數據，並進行實時單詞統計，統計URL出現的次數

下載基於大數據技術推薦系統實戰教程(Spark ML Spark Streaming Kafka Hadoop Mahout Flume Sqoop Redis)

spark streaming從指定offset處消費Kafka數據

【轉】Spark Streaming和Kafka整合開發指南

SDP（0）：Streaming-Data-Processor - Data Processing with Akka-Stream

基於Flume+Kafka+Spark Streaming打造實時流處理項目實戰課程

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（三）安裝spark2.2.1

Kafka：ZK+Kafka+Spark Streaming集群環境搭建（九）安裝kafka_2.11-1.1.0

Getting Streaming data from Kafka with Spark Streaming using Python.

Getting Streaming data from Kafka with Spark Streaming using Python.

相關推薦