Spark Streaming接收kafka資料，輸出到HBase

阿新 • • 發佈：2018-12-25

需求

Kafka + SparkStreaming + SparkSQL + HBase
輸出TOP5的排名結果
排名作為Rowkey，word和count作為Column

實現

建立kafka生產者模擬隨機生產資料

object producer {
  def main(args: Array[String]): Unit = {
    val topic ="words"
    val brokers ="master:9092,slave1:9092,slave2:9092"
    val prop=new Properties()
    prop.put("metadata.broker.list" 
,brokers)
    prop.put("serializer.class", "kafka.serializer.StringEncoder")

    val kafkaConfig=new ProducerConfig(prop)
    val producer=new Producer[String,String](kafkaConfig)

    val content:Array[String]=new Array[String](5)
    content(0)="kafka kafka produce"
    content(1)="kafka produce message" 

    content(2)="hello world hello"
    content(3)="wordcount topK topK"
    content(4)="hbase spark kafka"
    while (true){
      val i=(math.random*5).toInt
      producer.send(new KeyedMessage[String,String](topic,content(i)))
      println(content(i))
      Thread.sleep(200)
    }
  }
}

建立spark streaming

val conf = new SparkConf().setMaster("local[2]").setAppName("Networkcount")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds(1))

配置kafka，通過KafkaUtils.createDirectStream讀取kafka傳遞過來的資料

val topic = Set("words")
    val brokers = "master:9092,slave1:9092,slave2:9092"
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder")
    val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic)

使用sparksql進行wordcount與topN處理，寫入hbase

var rank = 0; //用來記錄當前資料序號
val sqlcontext = new SQLContext(sc)
    import sqlcontext.implicits._

    val lines = kafkaStream.window(Seconds(10), Seconds(3)).flatMap(line => {
      Some(line._2.toString)
    }).foreachRDD({ rdd: RDD[String] =>

      val df = rdd.flatMap(_.split(" ")).toDF.withColumnRenamed("_1", "word")
      val table = df.registerTempTable("words")

      val ans = sqlcontext.sql("select word, count(*) as total from words group by word order by count(*) desc").limit(5).map(x => {
        rank += 1
        (rank, x.getString(0), x.getLong(1))
      })
      rank = 0

資料寫入hbase的方式一（批量寫入）

ans.map(x => {
              val put = new Put(Bytes.toBytes(x._1.toString))
              put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("word"), Bytes.toBytes(x._2.toString))
              put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(x._3.toString))
              put
            }).foreachPartition(x => {
              val conn = ConnectionFactory.createConnection(HBaseConfiguration.create)
              val table = conn.getTable(TableName.valueOf("window"))
              //兩種獲得table的方式
              //        var jobConf = new JobConf(HBaseConfiguration.create)
              //        val table = new HTable(jobConf, TableName.valueOf("window"))
              import scala.collection.JavaConversions._
              table.put(seqAsJavaList(x.toSeq))
            })

資料寫入hbase的方式一（單條寫入）

      ans.foreachPartition(partitionRecords=>{
        val tablename = "window"
        val hbaseconf = HBaseConfiguration.create()
        val conn = ConnectionFactory.createConnection(hbaseconf)
        val tableName = TableName.valueOf(tablename)
        val table = conn.getTable(tableName)
        partitionRecords.foreach(x => {
          val put = new Put(Bytes.toBytes(x._1.toString))
          put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("word"), Bytes.toBytes(x._2.toString))
          put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(x._3.toString))
          table.put(put)
        })
        table.close()
      })
    })

使用saveAsHadoopDataset
saveAsHadoopFile是將RDD儲存在HDFS上的檔案中，支援老版本Hadoop API
saveAsHadoopDataset用於將RDD儲存到除了HDFS的其他儲存中，比如HBase

var jobConf = new JobConf(HBaseConfiguration.create)
      jobConf.set(TableOutputFormat.OUTPUT_TABLE, "window")
      jobConf.setOutputFormat(classOf[TableOutputFormat])//不加這句會報錯Undefined job output-path
      //在JobConf中，通常需要關注或者設定五個引數
檔案的儲存路徑、key值的class型別、value值的class型別、RDD的輸出格式(OutputFormat)、以及壓縮相關的引數
      ans.map(x => {
        val put = new Put(Bytes.toBytes(x._1.toString))
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("word"), Bytes.toBytes(x._2.toString))
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(x._3.toString))
        (new ImmutableBytesWritable , put)
      }).saveAsHadoopDataset(jobConf)

使用新版API：saveAsNewAPIHadoopDataset

sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE,"lxw1234")
      var job = new Job(sc.hadoopConfiguration)
      job.setOutputKeyClass(classOf[ImmutableBytesWritable])
      job.setOutputValueClass(classOf[Result])
      job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])//這句會報錯，不知原因
      ans.map(x => {
        val put = new Put(Bytes.toBytes(x._1.toString))
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("word"), Bytes.toBytes(x._2.toString))
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(x._3.toString))
        (new ImmutableBytesWritable , put)
      }).saveAsNewAPIHadoopDataset(job.getConfiguration)//執行會報空指標

啟動spark streaming

ssc.start()
    ssc.awaitTermination() //等待處理停止,stop()手動停止

執行

producer生產資料

這裡寫圖片描述

hbase實時更新資料

批量寫入，時間戳一致

這裡寫圖片描述

單條寫入，時間戳有差異

這裡寫圖片描述

Spark Streaming接收kafka資料，輸出到HBase

需求 Kafka + SparkStreaming + SparkSQL + HBase 輸出TOP5的排名結果排名作為Rowkey，word和count作為Column 實現建立kafka生產者模擬隨機生產資料 object produ

spark streaming 接收kafka資料寫入Hive分割槽表

直接上程式碼 object KafkaToHive{ def main(args: Array[String]){ val sparkConf = new SparkConf().setAppName("KafkaToHive") val sc = new SparkConte

spark streaming讀取kafka資料，記錄offset

如下是pom.xml檔案<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocati

kafka（六）：與spark streaming對接，spark streaming接收kafka資料來源

1.功能實現 spark streaming從kafka接收資料，有兩種方式，receiver和direct兩種方式。 2.pom依賴針對kafka_2.10-0.8.2.1版本 <!-- https

Spark-Streaming獲取kafka資料的兩種方式：Receiver與Direct的方

簡單理解為：Receiver方式是通過zookeeper來連線kafka佇列，Direct方式是直接連線到kafka的節點上獲取資料回到頂部使用Kafka的高層次Consumer API來實現。receiver從Kafka中獲取的資料都儲存在Spark Exec

學習筆記 --- Kafka Spark Streaming獲取Kafka資料 Receiver與Direct的區別

Receiver 使用Kafka的高層次Consumer API來實現 receiver從Kafka中獲取的資料都儲存在Spark Executor的記憶體中，然後Spark Streaming啟動的job會去處理那些資料要啟用高可靠機制，讓資料零丟失，就必須啟用Spark

spark streaming讀取kafka資料令丟失（二）

方式二：方法二就是每次streaming 消費了kafka的資料後，將消費的kafka offsets更新到zookeeper。當你的程式掛掉或者升級的時候，就可以接著上次的讀取，實現資料的令丟失和 at most once。而且使用checkpoint的方

Spark Streaming結合 Kafka 兩種不同的資料接收方式比較

DirectKafkaInputDStream 只在 driver 端接收資料，所以繼承了 InputDStream，是沒有 receivers 的在結合 Spark Streaming 及 Kafka 的實時應用中，我們通常使用以下兩個 API 來獲取最初的 DStream（這裡不關心這兩個 API 的

spark Streaming 直接消費Kafka資料，儲存到 HDFS 實戰程式設計實踐

最近在學習spark streaming 相關知識，現在總結一下主要程式碼如下 def createStreamingContext():StreamingContext ={ val sparkConf = new SparkConf().setAppName("

Spark Streaming從Kafka中獲取資料，並進行實時單詞統計，統計URL出現的次數

1、建立Maven專案 2、啟動Kafka 3、編寫Pom檔案 <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.or

Spark Streaming從Kafka中獲取數據，並進行實時單詞統計，統計URL出現的次數

scrip 發送消息 rip mark 3.2 umt 過程 bject ttr 1、創建Maven項目創建的過程參考：http://blog.csdn.net/tototuzuoquan/article/details/74571374 2、啟動Kafka A:安裝ka

大資料學習之路97-kafka直連方式（spark streaming 整合kafka 0.10版本）

我們之前SparkStreaming整合Kafka的時候用的是傻瓜式的方式-----createStream,但是這種方式的效率很低。而且在kafka 0.10版本之後就不再提供了。接下來我們使用Kafka直連的方式，這種方式其實是呼叫Kafka底層的消費資料的API,我們知道，越底層的東

Spark Streaming整合Kafka，Mysql，實時儲存資料到Mysql(基於Receiver的方式)

叢集分配如下： 192.168.58.11 spark01 192.168.58.12 spark02 192.168.58.13 spark03 spark版本：spark-2.1.0-bin-hadoop2.7 kafka版本：kafka_2.11-2.0.0 Spark St

Spark Streaming整合Kafka，Mysql，實時儲存資料到Mysql(直接讀取方式)

叢集分配如下： 192.168.58.11 spark01 192.168.58.12 spark02 192.168.58.13 spark03 spark版本：spark-2.1.0-bin-hadoop2.7 kafka版本：kafka_2.11-2.0.0 Spark St

Spark Streaming消費Kafka的資料進行統計

流處理平臺：這裡是第四步的實現： Spark Streaming整合Kafka採用的是Receiver-based，另一種方式Direct Approach，稍作修改就行。 package spark import org.apache.spark.SparkConf impo

Spark Streaming消費Kafka Direct方式資料零丟失實現

一、概述上次寫這篇文章文章的時候，Spark還是1.x，kafka還是0.8x版本，轉眼間spark到了2.x，kafka也到了2.x，儲存offset的方式也發生了改變，筆者根據上篇文章和網上文章，將offset儲存到Redis，既保證了併發也保證了資料不丟失，經過測試，有效。二、

Flume+Kafka+Spark Streaming實現大資料實時流式資料採集

大資料實時流式資料處理是大資料應用中最為常見的場景，與我們的生活也息息相關，以手機流量實時統計來說，它總是能夠實時的統計出使用者的使用的流量，在第一時間通知使用者流量的使用情況，並且最為人性化的為使用者提供各種優惠的方案，如果採用離線處理，那麼等到使用者流量超標

【Spark深入學習 -15】Spark Streaming前奏-Kafka初體驗

rod htm 新的 callback tails 包括 -c 舉例清理 ----本節內容------- 1.Kafka基礎概念 1.1 出世背景 1.2 基本原理 1.2.1.前置知識 1.2.2.架構和原理 1.2.

【轉】Spark Streaming和Kafka整合開發指南

thread ada 關系方法拷貝理解 1.2 reduce arr 基於Receivers的方法這個方法使用了Receivers來接收數據。Receivers的實現使用到Kafka高層次的消費者API。對於所有的Receivers，接收到的數據將會保存在Spark

scala spark-streaming整合kafka （spark 2.3 kafka 0.10）

obj required word 錯誤 prope apache rop sta move Maven組件如下： <dependency> <groupId>org.apache.spark</groupId> <

Spark Streaming接收kafka資料，輸出到HBase

需求

實現

執行

相關推薦