sparkStreaming+flume實現記憶體計算(小資料量情況下)

阿新 • • 發佈：2018-12-12

架構分析sparkStreaming一般結合kafka使用,但是如果你的資料量比較小,就可以不用搭建kafka叢集,那麼flume提供了兩種提供資料給sparkStreaming的方式一種是push,一種是Pull,Pull是sparkStreaming向flume拉取資料效果更好一些.因為push只能提供資料給一個spark,而Pull可以從多個flume進行拉取
1安裝flume到伺服器
上傳apache-flume-1.6.0-bin.tar.gz到伺服器
解壓縮
tar -zxf apache-flume-1.6.0-bin.tar.gz
改名
mv apache-flume-1.6.0-bin flume
2安裝JDK
請參考我的另一篇部落格(

https://blog.csdn.net/qq_16563637/article/details/81738113)
2修改配置檔案(重點)
cd flume/conf/
vi flume-pull.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
#一定要寫成flume所在那臺機器的地址
a1.sinks.k1.hostname = 192.168.1.103
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

儲存
或者採用push方法
vi flume-push.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#這是接收方這裡要寫成worker所在的IP地址
a1.sinks.k1.hostname = 192.168.31.172
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

儲存
3(重要)
spark-streaming-flume-sink_2.10-1.6.1.jar複製到flume的lib目錄
commons-lang3-3.3.2.jar複製到flume的lib目錄
scala-library-2.10.5.jar複製到flume的lib目錄
4設定flume的JAVA_HOME
vi flume-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_161
儲存
5先啟動flume
bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
6在本地啟動程式

package cn.itcast.spark.day5

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
//flume 執行語句
//bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
object FlumePollWordCount {
  def main(args: Array[String]) {
    //設定日誌級別
    LoggerLevels.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))
    //從flume中拉取資料(flume的地址)
    val address = Seq(new InetSocketAddress("192.168.1.103", 8888))
    val flumeStream = FlumeUtils.createPollingStream(ssc, address, StorageLevel.MEMORY_AND_DISK)
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

7在伺服器上建立一個檔案test.txt
vi test.txt
zhangsan is holle
lisi is hello
儲存
cp test.txt /export/data/flume/test.txt
檢視控制檯輸出
完畢

sparkStreaming+flume實現記憶體計算(小資料量情況下)

sparkStreaming+flume實現記憶體計算(小資料量情況下)

大資料量情況下查詢效能低，耗時長的一種問題以及解決思路

Integer和int的比較，大資料量情況下造成頻繁gc的原因分析

C++標準模板庫中list大資料量情況下析構效率的研究

微信文章閱讀數點贊數查詢API介面及實現(小資料量)

flume+kafka+storm整合實現實時計算小案例

MySQL 8.0 小資料量備份工具mysqldump

主要是解決，作為一個數據共享的資料庫，存在的資料庫統計，然後將計算的資料量輸出到自己使用的資料庫，進行主頁面展示。

unload/copy解決小資料量的Redshift到Greenplum遷移

Postgresql 遠端同步(非實時同步，小資料量)

JDK8 switch使用字串比if else 效率高,親測大資料量資料下

用介面實現，在非跳轉情況下，一個介面向Activity傳資料

基於Apache POI匯出（百萬級）大資料量Excel的實現

MySQL表上億級資料量實現刪除重複記錄

Hadoop學習筆記—4.初識MapReduce 一、神馬是高大上的MapReduce 　　MapReduce是Google的一項重要技術，它首先是一個程式設計模型，用以進行大資料量的計算。對於大資料

Volley---適合場景：適合資料量小、頻率高的請求，為什麼？

介面資料量太大，導致記憶體溢位，解決辦法

TensorFlow和Keras解決大資料量記憶體溢位問題

zList一個塊狀連結串列演算法可以申請和釋放同種物件指標，對於大資料量比直接new少需要差不多一半記憶體

資料量龐大的分頁穿梭框實現

sparkStreaming+flume實現記憶體計算(小資料量情況下)

相關推薦