sparkStreaming+flume實現記憶體計算(小資料量情況下)
架構分析sparkStreaming一般結合kafka使用,但是如果你的資料量比較小,就可以不用搭建kafka叢集,那麼flume提供了兩種提供資料給sparkStreaming的方式一種是push,一種是Pull,Pull是sparkStreaming向flume拉取資料效果更好一些.因為push只能提供資料給一個spark,而Pull可以從多個flume進行拉取
1安裝flume到伺服器
上傳apache-flume-1.6.0-bin.tar.gz到伺服器
解壓縮
tar -zxf apache-flume-1.6.0-bin.tar.gz
改名
mv apache-flume-1.6.0-bin flume
2安裝JDK
請參考我的另一篇部落格(
2修改配置檔案(重點)
cd flume/conf/
vi flume-pull.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/data/flume a1.sources.r1.fileHeader = true # Describe the sink a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink #一定要寫成flume所在那臺機器的地址 a1.sinks.k1.hostname = 192.168.1.103 a1.sinks.k1.port = 8888 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
儲存
或者採用push方法
vi flume-push.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/data/flume a1.sources.r1.fileHeader = true # Describe the sink a1.sinks.k1.type = avro #這是接收方這裡要寫成worker所在的IP地址 a1.sinks.k1.hostname = 192.168.31.172 a1.sinks.k1.port = 8888 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
儲存
3(重要)
spark-streaming-flume-sink_2.10-1.6.1.jar複製到flume的lib目錄
commons-lang3-3.3.2.jar複製到flume的lib目錄
scala-library-2.10.5.jar複製到flume的lib目錄
4設定flume的JAVA_HOME
vi flume-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_161
儲存
5先啟動flume
bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
6在本地啟動程式
package cn.itcast.spark.day5
import java.net.InetSocketAddress
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
//flume 執行語句
//bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
object FlumePollWordCount {
def main(args: Array[String]) {
//設定日誌級別
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(5))
//從flume中拉取資料(flume的地址)
val address = Seq(new InetSocketAddress("192.168.1.103", 8888))
val flumeStream = FlumeUtils.createPollingStream(ssc, address, StorageLevel.MEMORY_AND_DISK)
val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map((_,1))
val results = words.reduceByKey(_+_)
results.print()
ssc.start()
ssc.awaitTermination()
}
}
7在伺服器上建立一個檔案test.txt
vi test.txt
zhangsan is holle
lisi is hello
儲存
cp test.txt /export/data/flume/test.txt
檢視控制檯輸出
完畢