Spark學習筆記（15）——Spark Streaming 整合 Flume

阿新 • • 發佈：2018-11-06

1 flume 配置檔案

在這裡插入圖片描述

在 flume-env.sh 裡配置 JAVA_HOME

1.1 `flume-pull.conf`

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = node1
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

1.2 `flume-push.conf`

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#這是接收方
a1.sinks.k1.hostname = 192.168.30.1
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2 push 案例

2.1 pom

 <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.10</artifactId>
            <version>1.6.3</version>
 </dependency>

2.2 原始碼

package streamingAndflume

import mystreaming. 
LoggerLevels
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumePushCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("FlumePush").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))

    //flume 向 spark 傳送資訊，此處IP地址是本機電腦IP
    val flumeStream = FlumeUtils.createStream(ssc,"192.168.30.1",8888)

    //flume 中的資料通過event.getBody()才能拿到真正的內容
    val words = flumeStream.flatMap(x=>new String(x.event.getBody().array()).split(" ")).map((_,1))

    val results = words.reduceByKey(_+_)

    results.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

2.3 啟動 flume

在 /export/data/flume 放一些資料

[[email protected] apache-flume-1.6.0-bin]$ bin/flume-ng agent -n a1 -c conf -f conf/flume-push.conf

3 poll 方式

需要為在flume lib 下新增相應的 jar 包
參考連結https://spark.apache.org/docs/latest/streaming-flume-integration.html
在這裡插入圖片描述

3.1 poll原始碼

package streamingAndflume

import java.net.InetSocketAddress

import mystreaming.LoggerLevels
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumePollCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("FlumePush").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))

    //從 flume 中拉取資料，flume的地址,這裡可以傳很多地址
    val address = Seq(new InetSocketAddress("node1",8888))

    val flumeStream = FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK)
    val words = flumeStream.flatMap(x=>new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results = words.reduceByKey(_+_)

    results.print()

    ssc.start()
    ssc.awaitTermination()

  }
}

3.2 先啟動flume,再啟動程式

[[email protected] apache-flume-1.6.0-bin]$ bin/flume-ng agent -n a1 -c conf -f conf/flume-pull.conf

4 pull 方式在叢集中執行

啟動 Spark 叢集[[email protected] apache-flume-1.6.0-bin]$ /home/hadoop/apps/spark-1.6.3-bin-hadoop2.6/sbin/start-all.sh

4.1 原始碼

原始碼用 maven 打包

package streamingAndflume

import mystreaming.LoggerLevels
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumePushCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()

    val host = args(0)
    val port = args(1).toInt

    val conf = new SparkConf().setAppName("FlumePush")
    val ssc = new StreamingContext(conf,Seconds(5))

    //flume 向 spark 傳送資訊，此處IP地址是本機電腦IP
    val flumeStream = FlumeUtils.createStream(ssc,host,port)

    //flume 中的資料通過event.getBody()才能拿到真正的內容
    val words = flumeStream.flatMap(x=>new String(x.event.getBody().array()).split(" ")).map((_,1))

    val results = words.reduceByKey(_+_)

    results.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

4.2 啟動 jar

修改 flume-push.conf

a1.sinks.k1.hostname = node2

[[email protected] ~]$ /home/hadoop/apps/spark-1.6.3-bin-hadoop2.6/bin/spark-submit --master spark://node1:7077 --class streamingAndflume.FlumePushCount /home/hadoop/push.jar node2 8888

4.3 啟動 flume

[[email protected] apache-flume-1.6.0-bin]$ bin/flume-ng agent -n a1 -c conf -f conf/flume-pull.conf

Spark學習筆記（15）——Spark Streaming 整合 Flume

1 flume 配置檔案在 flume-env.sh 裡配置 JAVA_HOME 1.1 flume-pull.conf # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.chan

Spark學習筆記（16）——Spark Streaming 整合Kafka

1 啟動 zk(zookeeper-3.4.8) 三個節點同時操作 zkServer.sh start 2 啟動 Kafka 三個節點同時操作 kafka-server-start.sh /home/hadoop/apps/kafka_2.10-0.8.2.1/conf

Spark學習筆記（14）——Spark Streaming 資料累加的案例

1 原始碼 package mystreaming import org.apache.spark.{HashPartitioner, SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, Stre

Spark學習筆記（13）——Spark Streaming 案例

1 Spark Streaming 介紹 Spark Streaming類似於Apache Storm，用於流式資料的處理。根據其官方文件介紹，Spark Streaming有高吞吐量和容錯能力強等特點。Spark Streaming支援的資料輸入源很多，例如：Kafka、Flume

大資料實時計算Spark學習筆記（11）—— Spark Streaming

1 Spark Streaming spark core 的擴充套件，針對實時資料處理，具有可擴充套件、高吞吐、容錯；內部，spark 接受實時資料流，分成 batch 進行處理，最終在每個 batch 產生結果； 1.1 discretized strea

Spark學習筆記（18）Spark Streaming中空RDD處理

本期內容： 1 Spark Streaming中的空RDD處理 2 Spark Streaming程式的停止 1 Spark Streaming中的空RDD處理在Spark Streaming應用程式中，無論使用什麼 DStream，底層實際上就是操作RDD。從一

Spark學習筆記（三）-Spark Streaming

Spark Streaming支援實時資料流的可擴充套件（scalable）、高吞吐（high-throughput）、容錯（fault-tolerant）的流處理（stream processing）。 &n

Spark學習筆記（1）—— Spark 介紹，叢集安裝

1 Spark 介紹 Spark是一種快速、通用、可擴充套件的大資料分析引擎，2009年誕生於加州大學伯克利分校AMPLab，2010年開源，2013年6月成為Apache孵化專案，2014年2月成為Apache頂級專案。目前，Spark生態系統已經發展成為一個

Spark學習筆記（3）—— Spark計算模型 RDD

1 彈性分散式資料集RDD 1.1 什麼是 RDD RDD（Resilient Distributed Dataset）叫做分散式資料集，是Spark中最基本的資料抽象，它代表一個不可變、可分割槽、裡面的元素可平行計算的集合。RDD具有資料流模型的特點：自動容錯

Spark學習筆記（9）—— Spark IP位置查詢

1 資料來源 ip.txt 1.0.1.0|1.0.3.255|16777472|16778239|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.07530

Spark學習筆記（一）——spark簡介

Spark是什麼 spark是基於記憶體計算的大資料平行計算框架。09年誕生於加州大學伯克利分校AMPLab. Spark相較於Hadoop的優點 Spark是MapReduce的替代方案，相容HDFS、Hive等分散式儲存層，可融入Hadoop的生態

spark學習筆記（3）spark核心資料結構RDD

一個簡單的例子 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.Spar

Spark學習筆記（19）—— 遊戲日誌分析

1 資料 0 管理員登入 1 首次登入 2 上線 3 下線 1|2016年2月1日,星期一,10:01:08|10.51.4.168|李明剋星|法師|男|1|0|0/800000000 1|2016年2月1日,星期一,10:01:12|10.117.45.20|風道|道士|男

Spark學習筆記（18）—— Elasticsearch 安裝

1 解壓 [[email protected] ~]$ ll total 42708 drwxrwxr-x. 9 hadoop hadoop 180 Oct 24 21:33 apps -rw-r--r--. 1 hadoop hadoop 27540442 Ju

Spark學習筆記（17）——窗戶函式

1 原始碼 package window import mystreaming.LoggerLevels import org.apache.spark.SparkConf import org.apache.spark.streaming.{Milliseconds, Secon

Spark學習筆記（6）—— 網站訪問次數統計

1 資料檔案 20160321101954 http://java.itcast.cn/java/course/javaeeadvanced.shtml 20160321101954 http://ja

Spark學習筆記（10）—— wordcount 執行流程分析

1 啟動叢集啟動 HDFS start-dfs.sh 啟動 Spark 叢集 /home/hadoop/apps/spark-1.6.3-bin-hadoop2.6/sbin/start-all

大資料實時計算Spark學習筆記（7）—— RDD 資料傾斜處理

1 處理資料傾斜在 reduceByKey 之前先進行隨機分割槽 package com.bigdataSpark.cn import org.apache.spark.{SparkConf, SparkContext} import scala.util.Ran

大資料實時計算Spark學習筆記（10）—— Spar SQL(2) -JDBC方式操作表

1 Spark SQL 的 JDBC 方式 POM 檔案新增依賴 <dependency> <groupId>mysql</groupId> <artifactId>mysql-connect

大資料實時計算Spark學習筆記（9）—— Spar SQL(1) 讀取 json 檔案

1 Spark SQL 程式設計方式：（1）SQL;(2) DataFrame API scala> case class Customer(id:Int,name:String,age:Int) defined class Customer scala&g

Spark學習筆記（15）——Spark Streaming 整合 Flume

1 flume 配置檔案

1.1 flume-pull.conf

1.2 flume-push.conf

2 push 案例

2.1 pom

2.2 原始碼

2.3 啟動 flume

3 poll 方式

3.1 poll原始碼

3.2 先啟動flume,再啟動程式

4 pull 方式在叢集中執行

4.1 原始碼

4.2 啟動 jar

4.3 啟動 flume

相關推薦

1.1 `flume-pull.conf`

1.2 `flume-push.conf`