基於Kafka+SparkStreaming+HBase實時點選流案例
阿新 • • 發佈:2018-12-15
背景
Kafka實時記錄從資料採集工具Flume或業務系統實時介面收集資料,並作為訊息緩衝元件為上游實時計算框架提供可靠資料支撐,Spark 1.3版本後支援兩種整合Kafka機制(Receiver-based Approach 和 Direct Approach),具體細節請參考文章最後官方文件連結,資料儲存使用HBase
實現思路
實現Kafka訊息生產者模擬器
Spark-Streaming採用Direct Approach方式實時獲取Kafka中資料
Spark-Streaming對資料進行業務計算後資料儲存到HBase
本地虛擬機器叢集環境配置
由於筆者機器效能有限,hadoop/zookeeper/kafka叢集都搭建在一起主機名分別為hadoop1,hadoop2,hadoop3; hbase為單節點 在hadoop1
缺點及不足
由於筆者技術有限,程式碼設計上有部分缺陷,比如spark-streaming計算後資料儲存hbase邏輯效能很低,希望大家多提意見以便小編及時更正
程式碼實現
Kafka訊息模擬器
package clickstream import java.util.{Properties, Random, UUID} import kafka.producer.{KeyedMessage, Producer, ProducerConfig} import org.codehaus.jettison.json.JSONObject /** * Created by 郭飛 on 2016/5/31. */ object KafkaMessageGenerator { private val random = new Random() private var pointer = -1 private val os_type = Array( "Android", "IPhone OS", "None", "Windows Phone") def click() : Double = { random.nextInt(10) } def getOsType() : String = { pointer = pointer + 1 if(pointer >= os_type.length) { pointer = 0 os_type(pointer) } else { os_type(pointer) } } def main(args: Array[String]): Unit = { val topic = "user_events" //本地虛擬機器ZK地址 val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092" val props = new Properties() props.put("metadata.broker.list", brokers) props.put("serializer.class", "kafka.serializer.StringEncoder") val kafkaConfig = new ProducerConfig(props) val producer = new Producer[String, String](kafkaConfig) while(true) { // prepare event data val event = new JSONObject() event .put("uid", UUID.randomUUID())//隨機生成使用者id .put("event_time", System.currentTimeMillis.toString) //記錄時間發生時間 .put("os_type", getOsType) //裝置型別 .put("click_count", click) //點選次數 // produce event message producer.send(new KeyedMessage[String, String](topic, event.toString)) println("Message sent: " + event) Thread.sleep(200) } } }
Spark-Streaming主類
package clickstream import kafka.serializer.StringDecoder import net.sf.json.JSONObject import org.apache.hadoop.hbase.client.{HTable, Put} import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.{HBaseConfiguration, TableName} import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Created by 郭飛 on 2016/5/31. */ object PageViewStream { def main(args: Array[String]): Unit = { var masterUrl = "local[2]" if (args.length > 0) { masterUrl = args(0) } // Create a StreamingContext with the given master URL val conf = new SparkConf().setMaster(masterUrl).setAppName("PageViewStream") val ssc = new StreamingContext(conf, Seconds(5)) // Kafka configurations val topics = Set("PageViewStream") //本地虛擬機器ZK地址 val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092" val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder") // Create a direct stream val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val events = kafkaStream.flatMap(line => { val data = JSONObject.fromObject(line._2) Some(data) }) // Compute user click times val userClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _) userClicks.foreachRDD(rdd => { rdd.foreachPartition(partitionOfRecords => { partitionOfRecords.foreach(pair => { //Hbase配置 val tableName = "PageViewStream" val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.zookeeper.quorum", "hadoop1:9092") hbaseConf.set("hbase.zookeeper.property.clientPort", "2181") hbaseConf.set("hbase.defaults.for.version.skip", "true") //使用者ID val uid = pair._1 //點選次數 val click = pair._2 //組裝資料 val put = new Put(Bytes.toBytes(uid)) put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click)) val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName)) StatTable.setAutoFlush(false, false) //寫入資料快取 StatTable.setWriteBufferSize(3*1024*1024) StatTable.put(put) //提交 StatTable.flushCommits() }) }) }) ssc.start() ssc.awaitTermination() } }
Maven POM檔案
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.guofei.spark</groupId>
<artifactId>RiskControl</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>RiskControl</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!--Spark core 及 streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- Spark整合Kafka-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- 整合Hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.96.2-hadoop2</version>
<type>pom</type>
</dependency>
<!--Hbase依賴 -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.3</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
<version>3.6.6.Final</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.5</version>
</dependency>
<dependency>
<groupId>org.cloudera.htrace</groupId>
<artifactId>htrace-core</artifactId>
<version>2.01</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-xc</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.4</version>
</dependency>
<!-- Hadoop依賴包-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.4</version>
<classifier>jdk15</classifier>
</dependency>
<dependency>
<groupId>org.codehaus.jettison</groupId>
<artifactId>jettison</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>2.2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-make:transitive</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
執行Spark-Streaming程式報錯
org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(
這裡面的程式碼中所包含的物件必須是序列化的
這裡面的程式碼中所包含的物件必須是序列化的
這裡面的程式碼中所包含的物件必須是序列化的
})
})
})
執行Maven打包報錯,找不到依賴的jar包
error:not found: object kafka
ERROR import kafka.javaapi.producer.Producer