基於MapReduce的HBase開發

阿新 • • 發佈：2019-01-18

在偽分散式模式和全分散式模式下 HBase 是架構在 HDFS 上的，因此完全可以將MapReduce 程式設計框架和 HBase 結合起來使用。也就是說，將 HBase 作為底層“儲存結構”，MapReduce 呼叫 HBase 進行特殊的處理，這樣能夠充分結合 HBase 分散式大型資料庫和MapReduce 平行計算的優點。

相對應MapReduce的hbase實現類：

1）InputFormat 類：HBase 實現了 TableInputFormatBase 類，該類提供了對錶資料的大部分操作，其子類 TableInputFormat 則提供了完整的實現，用於處理表資料並生成鍵值對。TableInputFormat 類將資料表按照 Region 分割成 split，既有多少個 Regions 就有多個splits。然後將 Region 按行鍵分成<key,value>對，key 值對應與行健，value 值為該行所包含的資料。
2）Mapper 類和 Reducer 類：HBase 實現了 TableMapper 類和 TableReducer 類，其中TableMapper 類並沒有具體的功能，只是將輸入的<key,value>對的型別分別限定為 Result 和ImmutableBytesWritable。IdentityTableMapper 類和 IdentityTableReducer 類則是上述兩個類的具體實現，其和 Mapper 類和 Reducer 類一樣，只是簡單地將<key,value>對輸出到下一個階段。

3）OutputFormat 類：HBase 實現的 TableOutputFormat 將輸出的<key,value>對寫到指定的 HBase 表中，該類不會對 WAL（Write-Ahead Log）進行操作，即如果伺服器發生
故障將面臨丟失資料的風險。可以使用 MultipleTableOutputFormat 類解決這個問題，該類可以對是否寫入 WAL 進行設定。

程式碼：

import java.io.IOException; 
import java.util.Iterator; 
import java.util.StringTokenizer; 
 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.hbase.HBaseConfiguration; 
import org.apache.hadoop.hbase.HColumnDescriptor; 
import org.apache.hadoop.hbase.HTableDescriptor; 
import org.apache.hadoop.hbase.client.HBaseAdmin; 
import org.apache.hadoop.hbase.client.Put; 
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat; 
import org.apache.hadoop.hbase.mapreduce.TableReducer; 
import org.apache.hadoop.hbase.util.Bytes; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.NullWritable; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
 
public class WordCountHBase { 
 
  // 實現 Map 類 
  public static class Map extends 
      Mapper<LongWritable, Text, Text, IntWritable> { 
    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 
 
    public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException { 
      StringTokenizer itr = new StringTokenizer(value.toString()); 
      while (itr.hasMoreTokens()) { 
        word.set(itr.nextToken()); 
        context.write(word, one); 
      } 
    } 
  } 
 
  // 實現 Reduce 類 
  public static class Reduce extends 
      TableReducer<Text, IntWritable, NullWritable> { 
 
    public void reduce(Text key, Iterable<IntWritable> values, 
        Context context) throws IOException, InterruptedException { 
 
      int sum = 0; 
 
      Iterator<IntWritable> iterator = values.iterator(); 
      while (iterator.hasNext()) { 
        sum += iterator.next().get(); 
      } 
 
      // Put 例項化，每個詞存一行 
      Put put = new Put(Bytes.toBytes(key.toString())); 
      // 列族為 content，列修飾符為 count，列值為數目 
      put.add(Bytes.toBytes("content"), Bytes.toBytes("count"), 
          Bytes.toBytes(String.valueOf(sum))); 
 
      context.write(NullWritable.get(), put); 
    } 
  } 
 
  // 建立 HBase 資料表 
  public static void createHBaseTable(String tableName)  
throws IOException { 
    // 建立表描述 
    HTableDescriptor htd = new HTableDescriptor(tableName); 
    // 建立列族描述 
    HColumnDescriptor col = new HColumnDescriptor("content"); 
    htd.addFamily(col); 
 
    // 配置 HBase 
    Configuration conf = HBaseConfiguration.create(); 
 
    conf.set("hbase.zookeeper.quorum","master"); 
    conf.set("hbase.zookeeper.property.clientPort", "2181"); 
    HBaseAdmin hAdmin = new HBaseAdmin(conf); 
 
    if (hAdmin.tableExists(tableName)) { 
      System.out.println("該資料表已經存在，正在重新建立。"); 
      hAdmin.disableTable(tableName); 
      hAdmin.deleteTable(tableName); 
    } 
 
    System.out.println("建立表：" + tableName); 
    hAdmin.createTable(htd); 
  } 
 
  public static void main(String[] args) throws Exception { 
    String tableName = "wordcount"; 
    // 第一步：建立資料庫表 
    WordCountHBase.createHBaseTable(tableName); 
 
    // 第二步：進行 MapReduce 處理 
    // 配置 MapReduce 
    Configuration conf = new Configuration(); 
    // 這幾句話很關鍵 
    conf.set("mapred.job.tracker", "master:9001"); 
    conf.set("hbase.zookeeper.quorum","master"); 
    conf.set("hbase.zookeeper.property.clientPort", "2181"); 
    conf.set(TableOutputFormat.OUTPUT_TABLE, tableName); 
 
    Job job = new Job(conf, "New Word Count"); 
    job.setJarByClass(WordCountHBase.class); 
 
    // 設定 Map 和 Reduce 處理類 
    job.setMapperClass(Map.class); 
    job.setReducerClass(Reduce.class); 
 
    // 設定輸出型別 
    job.setMapOutputKeyClass(Text.class); 
    job.setMapOutputValueClass(IntWritable.class); 
 
    // 設定輸入和輸出格式 
    job.setInputFormatClass(TextInputFormat.class); 
    job.setOutputFormatClass(TableOutputFormat.class); 
 
    // 設定輸入目錄 
    FileInputFormat.addInputPath(job, new Path("hdfs://master:9000/in/")); 
    System.exit(job.waitForCompletion(true) ? 0 : 1); 
 
  } 
}

常見錯誤及解決方法：

1、java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat

錯誤輸出節選：

13/09/10 21:14:01 INFO mapred.JobClient: Running job: job_201308101437_0016
13/09/10 21:14:02 INFO mapred.JobClient:  map 0% reduce 0%
13/09/10 21:14:16 INFO mapred.JobClient: Task Id : attempt_201308101437_0016_m_000007_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:849)
	at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:235)
	at org.apache.hadoop.mapred.Task.initialize(Task.java:513)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:249)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:802)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:847)
	... 8 more

錯誤原因：

相關的類檔案沒有引入到 Hadoop 叢集上。

解決步驟：

步驟一、停止HBase資料庫：

[[email protected] bin]$ stop-hbase.sh 
stopping hbase............
master: stopping zookeeper.
[[email protected] bin]$ jps
16186 Jps
26186 DataNode
26443 TaskTracker
26331 JobTracker
26063 NameNode

停止Hadoop叢集：

[[email protected] bin]$ stop-all.sh 
Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
master: Warning: $HADOOP_HOME is deprecated.
master: 
master: stopping tasktracker
node1: Warning: $HADOOP_HOME is deprecated.
node1: 
node1: stopping tasktracker
stopping namenode
master: Warning: $HADOOP_HOME is deprecated.
master: 
master: stopping datanode
node1: Warning: $HADOOP_HOME is deprecated.
node1: stopping datanode
node1: 
node1: Warning: $HADOOP_HOME is deprecated.
node1: 
node1: stopping secondarynamenode
[[email protected] bin]$ jps
16531 Jps

步驟二、需要配置 Hadoop 叢集中每臺機器，在 hadoop 目錄的 conf 子目錄中，找 hadoop-env.sh檔案，並新增如下內容：

# set hbase environment
export HBASE_HOME=/opt/modules/hadoop/hbase/hbase-0.94.11-security
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.11-security.jar:$HBASE_HOME/hbase-0.94.11-security-tests.jar:$HBASE_HOME/conf:$HBASE_HOME/lib/zookeeper-3.4.5.jar

步驟三、重新啟動叢集和hbase資料庫。

2、Error: java.lang.ClassNotFoundException: com.google.protobuf.Message

錯誤輸出節選：

2013-09-12 12:38:57,833 INFO  mapred.JobClient (JobClient.java:monitorAndPrintJob(1363)) -  map 0% reduce 0%
2013-09-12 12:39:12,490 INFO  mapred.JobClient (JobClient.java:monitorAndPrintJob(1392)) - Task Id : attempt_201309121232_0001_m_000007_0, Status : FAILED
Error: java.lang.ClassNotFoundException: com.google.protobuf.Message
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

錯誤原因：

明顯，沒找到protobuf-java-2.4.0a.jar包，將該包路徑加入hadoop-env.sh中。

基於MapReduce的HBase開發

相對應MapReduce的hbase實現類：

常見錯誤及解決方法：

1、java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat

錯誤輸出節選：

錯誤原因：

解決步驟：

2、Error: java.lang.ClassNotFoundException: com.google.protobuf.Message

錯誤輸出節選：

基於node開發的http請求代理

基於jquery開發的UI框架整理分析

如何基於WKWebView開發一個功能完善的資訊內容頁

後臺接口平臺基於Laravel 開發快速開發數據接口

基於Spring開發的一個BIO-RPC框架(對小白很友好)

基於pyQt5開發的股價顯示器（原創）

Spring基於註解開發異常

學習手記-基於iTOP4412開發板NFS服務器搭建及測試

Django基於Pycharm開發之四[關於靜態文件的使用,配置以及源碼分析]（原創）

基於socketserver開發多線程ftp

如何基於 k8s 開發高可靠服務？容器雲牛人有話說

基於flask開發web微信

一、ESP8266入門（基於LUA開發）

[ Python ] Flask 基於 Web開發大型程序的結構實例解析

基於CXF開發crm服務

基於servlet開發的財務收支管理系統

基於java開發的開原始碼GPS北斗位置服務監控平臺

JSON Web Token實戰篇——基於koa開發WEB後臺認證機制

嵌入式Linux基於QML開發QtMultimedia應用

基於Omapl138開發板linux3.3系統分析do_initcall()函式

基於MapReduce的HBase開發

相對應MapReduce的hbase實現類：

常見錯誤及解決方法：

1、java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat

錯誤輸出節選：

錯誤原因：

解決步驟：

2、Error: java.lang.ClassNotFoundException: com.google.protobuf.Message

錯誤輸出節選：

相關推薦