Kafka->SparkStreaming->Hbase【二】

阿新 • • 發佈：2019-02-02

根據業務需求，將Kafka中資料抽取插入到Hbase中。目前網上可以找到許多相關的文章，這裡介紹Github上的一個開源工具。

上一章節講到選擇SparkOnHbase為主要原型，將之修改為我們需要的原始碼。這裡給出修改之後的原始碼，修改之後符合我們的業務需求，並儘量避免引起其他不必要的問題。同時，後期優化程式執行效率問題。

原始碼

class HBaseContext(
  @transient sc:        SparkContext,
  @transient config:    Configuration,
  metas:                java.util.HashMap[String, java.util.HashMap[String, java.util.HashMap[String, ColumnInfo]]],
  val tmpHdfsConfgFile: String                                                                                      = null) 
 extends Serializable with Logging {

  @transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
  @transient var tmpHdfsConfiguration: Configuration = config
  @transient var appliedCredentials = false;
  @transient var metasLocal = metas
  @transient val job = new Job(config)
  TableMapReduceUtil.initCredentials(job)
  val 
 broadcastedConf = sc.broadcast(new SerializableWritable(config))
  val credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
  val broadcastMetas = sc.broadcast(metas)

  if (tmpHdfsConfgFile != null && config != null) {
    val fs = FileSystem.newInstance(config)
    val 
 tmpPath = new Path(tmpHdfsConfgFile)
    if (!fs.exists(tmpPath)) {
      val outputStream = fs.create(tmpPath)
      config.write(outputStream)
      outputStream.close();
    } else {
      logWarning("tmpHdfsConfigDir " + tmpHdfsConfgFile + " exist!!")
    }
  }

  def mapPartition[T, R: ClassTag](
    rdd: RDD[T],
    mp:  (Iterator[T], HConnection) => Iterator[R]): RDD[R] = {

    rdd.mapPartitions[R](it => hbaseMapPartition[T, R](
      broadcastedConf,
      it,
      mp), true)
  }

  def applyCreds[T](configBroadcast: Broadcast[SerializableWritable[Configuration]]) {

    credentials = SparkHadoopUtil.get.getCurrentUserCredentials()

    logInfo("appliedCredentials:" + appliedCredentials + ",credentials:" + credentials);

    if (appliedCredentials == false && credentials != null) {
      appliedCredentials = true
      logCredInformation(credentials)

      @transient val ugi = UserGroupInformation.getCurrentUser();
      ugi.addCredentials(credentials)
      ugi.setAuthenticationMethod(AuthenticationMethod.PROXY)
      ugi.addCredentials(credentialsConf.value.value)

    }
  }

  def logCredInformation[T](credentials2: Credentials) {
    logInfo("credentials:" + credentials2);
    for (a <- 0 until credentials2.getAllSecretKeys.size()) {
      logInfo("getAllSecretKeys:" + a + ":" + credentials2.getAllSecretKeys.get(a));
    }
    val it = credentials2.getAllTokens.iterator();
    while (it.hasNext) {
      logInfo("getAllTokens:" + it.next());
    }
  }

  def bulkMutation[T](rdd: RDD[T], fun: (T) => (DataEntity), autoFlush: Boolean) {

    rdd.foreachPartition(
      it => {
        hbaseForeachPartition[T](
          broadcastedConf, broadcastMetas,
          it,
          (iter, hConnection, metas) => {

            iter.foreach(item => {

              val entity = fun(item)
              val dbName = entity.dbName
              val tabName = entity.tabName
              if (metas.containsKey(dbName) && metas.get(dbName).containsKey(tabName)) {

                val htable = hConnection.getTable(entity.dbName + ":" + entity.tabName)
                htable.setAutoFlush(autoFlush, true)

                entity.`type` match {
                  case "INSERT" | "insert" => {
                    val insertPuts = Instance.insert(entity, metas)
                    if (null != insertPuts && insertPuts.size() > 0)
                      htable.batch(insertPuts)
                  }

                  case "UPDATE" | "update" => {
                    val updatePuts = Instance.update(entity, metas)
                    if (null != updatePuts && updatePuts.size() > 0)
                      htable.batch(updatePuts)
                  }

                  case "DELETE" | "delete" => {
                    val deleteDels = Instance.delete(entity)
                    if (null != deleteDels && deleteDels.size() > 0)
                      htable.batch(deleteDels)
                  }

                  case all: Any => {
                    logInfo("其他操作：" + all)
                  }
                }

                htable.flushCommits()
                htable.close()
              }
            })
          })
      })
  }

  def hbaseRDD[U: ClassTag](tableName: String, scan: Scan, f: ((ImmutableBytesWritable, Result)) => U): RDD[U] = {

    var job: Job = new Job(getConf(broadcastedConf))

    TableMapReduceUtil.initCredentials(job)
    TableMapReduceUtil.initTableMapperJob(tableName, scan, classOf[IdentityTableMapper], null, null, job)

    sc.newAPIHadoopRDD(
      job.getConfiguration(),
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result]).map(f)
  }

  def hbaseRDD(tableName: String, scans: Scan): RDD[(Array[Byte], java.util.List[(Array[Byte], Array[Byte], Array[Byte])])] = {

    hbaseRDD[(Array[Byte], java.util.List[(Array[Byte], Array[Byte], Array[Byte])])](
      tableName,
      scans,
      (r: (ImmutableBytesWritable, Result)) => {
        val it = r._2.list().iterator()
        val list = new ArrayList[(Array[Byte], Array[Byte], Array[Byte])]()

        while (it.hasNext()) {
          val kv = it.next()
          list.add((kv.getFamily(), kv.getQualifier(), kv.getValue()))
        }

        (r._1.copyBytes(), list)
      })
  }

  private def hbaseForeachPartition[T](
    configBroadcast: Broadcast[SerializableWritable[Configuration]],
    metasBroadcast:  Broadcast[HashMap[String, HashMap[String, HashMap[String, ColumnInfo]]]],
    it:              Iterator[T],
    fun:               (Iterator[T], HConnection, HashMap[String, HashMap[String, HashMap[String, ColumnInfo]]]) => Unit) = {

    val config = getConf(configBroadcast)
    val metas = getMetas(metasBroadcast)
    applyCreds(configBroadcast)
    val hConnection = HConnectionManager.createConnection(config)
    fun(it, hConnection, metas)
    hConnection.close()

  }

  /**
   * @desc get METAS from broadcast or driver's configure
   */
  private def getMetas(metasBroadcast: Broadcast[HashMap[String, HashMap[String, HashMap[String, ColumnInfo]]]]): HashMap[String, HashMap[String, HashMap[String, ColumnInfo]]] = {

    if (null != metasLocal) {
      return metasLocal
    } else {
      try {
        metasLocal = metasBroadcast.value
        metasLocal
      } catch {
        case ex: Exception => {
          logInfo("Unable to getConfig from broadcast")
        }
      }
    }
    metasLocal
  }

  private def getConf(configBroadcast: Broadcast[SerializableWritable[Configuration]]): Configuration = {

    if (tmpHdfsConfiguration != null) {
      tmpHdfsConfiguration
    } else if (tmpHdfsConfgFile != null) {

      val fs = FileSystem.newInstance(SparkHadoopUtil.get.conf)

      val inputStream = fs.open(new Path(tmpHdfsConfgFile))
      tmpHdfsConfiguration = new Configuration(false)
      tmpHdfsConfiguration.readFields(inputStream)
      inputStream.close()

      tmpHdfsConfiguration
    }

    if (tmpHdfsConfiguration == null) {
      try {
        tmpHdfsConfiguration = configBroadcast.value.value
        tmpHdfsConfiguration
      } catch {
        case ex: Exception => {
          println("Unable to getConfig from broadcast")
        }
      }
    }

    tmpHdfsConfiguration
  }

  private def hbaseMapPartition[K, U](
    configBroadcast: Broadcast[SerializableWritable[Configuration]],
    it:              Iterator[K],
    mp:              (Iterator[K], HConnection) => Iterator[U]): Iterator[U] = {

    val config = getConf(configBroadcast)
    applyCreds(configBroadcast)
    val hConnection = HConnectionManager.createConnection(config)
    val res = mp(it, hConnection)
    hConnection.close()
    res

  }

  private class GetMapPartition[T, U](
    tableName:     String,
    batchSize:     Integer,
    makeGet:       (T) => Get,
    convertResult: (Result) => U) extends Serializable {

    def run(iterator: Iterator[T], hConnection: HConnection): Iterator[U] = {
      val htable = hConnection.getTable(tableName)

      val gets = new ArrayList[Get]()
      var res = List[U]()

      while (iterator.hasNext) {
        gets.add(makeGet(iterator.next))

        if (gets.size() == batchSize) {
          var results = htable.get(gets)
          res = res ++ results.map(convertResult)
          gets.clear()
        }
      }

      if (gets.size() > 0) {
        val results = htable.get(gets)
        res = res ++ results.map(convertResult)
        gets.clear()
      }

      htable.close()
      res.iterator
    }
  }

  def fakeClassTag[T]: ClassTag[T] = ClassTag.AnyRef.asInstanceOf[ClassTag[T]]

}

根據我們的需求，重構了HbaseContext的原始碼，刪除了不必要的程式程式碼，從源頭上保證了程式適用於我們的應用場景。

SparkSteaming程式碼


    /** initialize ZK UTIL */
    @transient val zkUtil = new CuratorUtil()

    /** get initialize parameters */
    val offsetPath = PropertiesUtil.getProperty(ConstantUtil.ZOOKEEPER_SPARK_PATH)
    zkUtil.createZKNodePer(offsetPath, null)

    val topic = PropertiesUtil.getProperty(ConstantUtil.KAFKA_TOPIC_NAME)
    val recTime = Integer.parseInt(PropertiesUtil.getProperty(ConstantUtil.STREAMING_RECTCKE_TIME))
    val ZK_MYSQL_PATH = PropertiesUtil.getProperty(ConstantUtil.ZOOKEEPER_NAMESPACE_MYSQL_TABLES);
    val brokerList = PropertiesUtil.getProperty(ConstantUtil.KAFKA_BROKER_LIST);

    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokerList,
      "zookeeper.connect" -> PropertiesUtil.getProperty(ConstantUtil.ZOOKEEPER_SERVER_LIST),
      "group.id" -> PropertiesUtil.getProperty(ConstantUtil.KAFKA_CONSUMER_GROUPID))

    /** initialize HBASE METAS for filter */
    @transient @volatile var metas: java.util.HashMap[String, java.util.HashMap[String, java.util.HashMap[String, ColumnInfo]]] = Instance.paserMetas(zkUtil, ZK_MYSQL_PATH)
    if (metas.size() < 1) {
      println("load hbase tablem metas failed!")
      return ;
    }

    /**  initialize Context */
    // configure
    @transient val sparkConf = new SparkConf()
      .set("spark.streaming.backpressure.enabled", PropertiesUtil.getProperty(ConstantUtil.STREAMING_BACK_ENABLED)) // 設定可以限制
      .set("spark.streaming.kafka.maxRatePerPartition", PropertiesUtil.getProperty(ConstantUtil.STREAMING_KAFKA_MAXRATE)) // 設定具體限制數量：records/SEC
      .set("spark.streaming.stopGracefullyOnShutdown", PropertiesUtil.getProperty(ConstantUtil.STREAMING_SHUTDOWN_GRACEFULLLY)) // 設定Gracefully stop
      .set("serializer.class", "kafka.serializer.StringEncoder")
    @transient val hbaseConf = HBaseConfiguration.create();
    hbaseConf.addResource("/etc/hbase/conf.cloudera.hbase/hbase-site.xml")
    hbaseConf.addResource("/etc/hbase/conf.cloudera.hbase/core-site.xml")
    @transient val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(recTime));

    val fromOffsets = readOffsetData(zkUtil, offsetPath, topic, brokerList, 9092)
    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, (mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message()))

    stream.foreachRDD(rdd => {

      val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges.map { offset => (offset.partition, offset.fromOffset) }
      writeOffsetData(zkUtil, offsetPath, offsets)

      val hbaseContext = new HBaseContext(sc, hbaseConf, metas)
      hbaseContext.bulkMutation(rdd.map(item => item._2), (KV: String) => {
        Instance.parse(KV)
      }, false)

    })

    /** add gracefully stop control */
    Runtime.getRuntime.addShutdownHook(new Thread {
      override def run(): Unit = {
        try {
          zkUtil.close()
        } catch {
          case e: Exception => {
          }
        }
        ssc.stop(true, true)
      }
    })

    /** spark streaming start and wait termination */
    ssc.start()
    ssc.awaitTermination()

  }

 /**
   * @desc read data from Zookeeper
   */
  def readOffsetData(zkUtil: CuratorUtil, offsetPath: String, topic: String, brokerList: String, kafkaPort: Integer): Map[TopicAndPartition, Long] = {

    val orgData = zkUtil.readDataForPath(offsetPath)
    if (null == orgData) {
      val util = KafkaUtil.getInstance();
      util.init(brokerList, kafkaPort, topic);
      val offsets = util.getLeastOffsets
      val fromOffsets = for (i <- 0 to offsets.size() - 1)
        yield TopicAndPartition.apply(topic, i) -> offsets.get(i).toLong
      return fromOffsets.toMap
    }

    val data = JSON.parseFull(orgData).get.asInstanceOf[Map[String, String]]
    val fromOffsets = data.map(item => {
      TopicAndPartition.apply(topic, item._1.toInt) -> item._2.toLong
    })
    return fromOffsets

  }

  /**
   * @desc write offset data to Zookeeper
   */
  def writeOffsetData(zkUtil: CuratorUtil, offsetPath: String, data: Array[(Int, Long)]): Unit = {

    val map = data.toMap[Int, Long].map(item => {
      item._1.toString() -> item._2.toString()
    })
    zkUtil.setDataForPath(offsetPath, JSONObject(map).toString)

  }

Kafka->SparkStreaming->Hbase【二】

根據業務需求，將Kafka中資料抽取插入到Hbase中。目前網上可以找到許多相關的文章，這裡介紹Github上的一個開源工具。上一章節講到選擇SparkOnHbase為主要原型，將之修改為我們需要的原始碼。這裡給出修改之後的原始碼，修改之後符合我們的

【二】SCCM部署規劃

sccm 規劃 sccm 本篇主要對SCCM的前期規劃進行討論。 1.Active Directory整合在ConfigMgr部署中，可以選擇CofigMgr和現有的Active Directory進行整合，既然整合與否並非強制性的，那與Active Directo

[轉]Web APi之認證（Authentication）兩種實現方式【二】（十三）

用戶數 ted das 客戶元素基礎目標開始 net 本文轉自：http://www.cnblogs.com/CreateMyself/p/4857799.html 前言上一節我們詳細講解了認證及其基本信息，這一節我們通過兩種不同方式來實現認證，並且分析如

【搜索練習】【二】

std 技術分享 arc out gin load cnblogs view mes 1617: 阿克曼函數（遞歸）時間限制: 1 Sec 內存限制: 128 MB提交: 135 解決: 91[提交][狀態][討論版] 題目描述阿克曼（

【二】遺傳算法（GA）的MATLAB實現

tool view ima baidu ges matlab實現編程 from 函數調用 essay from：https://wenku.baidu.com/view/ce45bbf44693daef5ef73df3.html 一、MATLAB編程實現GA

大數據【二】HDFS部署及文件讀寫（包含eclipse hadoop配置）

throw 大數據我的電腦 ssh 生效 manager 方法 slave .sh 一　　原理闡述 1‘　　DFS 　　　　分布式文件系統（即DFS，Distributed File System），指文件系統管理的物理存儲資源不一定直接連接在本地節點上，而是通過計算機網

008-shiro與spring web項目整合【二】認證、授權、session管理

添加 ner != efi ebs ref private date err 一、認證 1、添加憑證匹配器添加憑證匹配器實現md5加密校驗。修改applicationContext-shiro.xml：

【二】剛學Python的幾道簡單練習題

print text for 輸入密碼 color 三次 center p s int python交友娛樂會所：613176398 1、使用while循環輸入 1 2 3 4 5 6 8 9 10 2、求1-100的所有數的和 3、輸出 1-100 內的所

【二】MongoDB入門

ger 劃分 lob .com 字符 help further 如果 keys 下面是mongodb的一些基本概念：文檔是MongoDB中數據的基本單元，類似關系數據庫中的行。集合，是存儲文檔的容器，類似關系數據庫中的表。 MongoDB的單個實例容納多個數據庫，每個

riot.js教程【二】組件撰寫準則、預處理器、標簽樣式和裝配方法

def coffee 將在 tom enter 名稱 spa 配方法 undefined 基本要求一個riot標簽，就是展現和邏輯的組合（也就是html和JS）；以下是編寫riot標簽最基本的規則：先撰寫HTML，再撰寫JS，JS代碼可以寫在<script

移動端二三事【二】：移動端觸摸事件點透及多種解決方案。

優化提前 sta 屬性 lis 剛才觸摸事件功能觸發大家都知道的少說，多分享一些幹貨。一、首先說移動端的三大主要事件： 1.手指按下： ontouchstart2.手指移動：ontouchmove3.手指擡起 ontouchend *使用移動端事件時，為盡

PHP+Redis 實例【二】頁面緩存新玩法

做了 urn 最好博客更新有一個返回致命傷什麽今天算是認識到博客園裏的審查團隊多內幕了，哈哈，貼個圖玩下。氣死寶寶了。進入主題！今天就不寫什麽功能性的了，換下口味說下關於頁面級的緩存，應該怎麽做。相信有很多小夥伴查了百度，甚至google，

004-詮釋 Java 工程師【二】

configure 帶來 pat 9.png 開放 images 組件測試根據三、框架篇框架基礎反射：反射是Java開發的一類動態相關機制。因為本身Java語言並不是一款動態語言，如果我們想要得到程序動態的效果，因此便引入了反射機制這一概念。

【Linux】【二】linux 壓縮文件（txt）、查看壓縮文件內容、解壓縮文件、

str tool div png gun medium spa clas info 通過Xshell 壓縮文件、解壓縮文件 gzip　　tools.txt　　　　　　　　壓縮【tools.txt】文件 zcat　　tools.txt.gz　　　　　　查看壓縮文件

Mongodb高級查詢【二】

osql log _id 入參 init 講解 mon 關系 class 上一篇文章，寫了 mongodb常規操作，繼續寫入，本章主要講高級查詢，文本，聚集，大數據查詢。 Mongodb的查詢語法是很多的，是NOSQL隊伍中比較豐富的一個。當然有很多查詢跟關系型查詢無法相比

Web APi之認證（Authentication）兩種實現方式【二】（十三）

基於web 推薦 zed {0} scheme sage https 函數 ges 原文:Web APi之認證（Authentication）兩種實現方式【二】（十三）前言上一節我們詳細講解了認證及其基本信息，這一節我們通過兩種不同方式來實現認證，並且分析如何合理的利用

013-Spring Boot web【二】靜態資源、Servlet、Filter、listenter

ces 其中 bean response cat 使用修改配置 dac tostring 一、靜態資源 1.1、webapp默認支持靜態資源在src/main/webapp下建立user.html默認支持訪問 1.2、默認內置靜態資源目錄。可被直接訪問查看包：

004-spring-data-elasticsearch 3.0.0.0使用【二】-spring-data之定義方法、創建repository實例、從聚合根發布事件

-- ble sch current 4.3 ... reference tex manager 續上文 1.4、定義方法　　存儲庫代理有兩種方法可以從方法名稱派生特定於存儲的查詢。它可以直接從方法名稱派生查詢，或者使用手動定義的查詢。可用選項取決於實際store。但

SpringCloud學習之sleuth&zipkin【二】

bus dep char str gradle 問題 -m onclick runt 　　這篇文章我們針對上篇鏈路跟蹤的遺留問題進行改造升級一、將追蹤數據存放到MySQL數據庫中　　默認情況下zipkin將收集到的數據存放在內存中(In-Memeroy)，但是不可避

Docker小白從零入門到實戰系列【二】

nta 1.0 mkdir -p strong tails RR 註意 3.0 通過 1.安裝好Centos 7 2.關閉SELINUX sed -i ‘s#SELINUX=enforcing#SELINUX=disabled#g‘ /etc/selinux/configs

Kafka->SparkStreaming->Hbase【二】

原始碼

SparkSteaming程式碼

相關推薦