Hbase 整合 Hadoop 的資料遷移

阿新 • • 發佈：2020-04-04

上篇文章說了 Hbase 的基礎架構，都是比較理論的知識，最近我也一直在搞 Hbase 的資料遷移，今天就來一篇實戰型的，把最近一段時間的 Hbase 整合 Hadoop 的基礎知識在梳理一遍，畢竟當初搞得時候還是有點摸不著方向，寫下來也方便以後查閱。

之前使用 Hbase 大多是把它當做實時資料庫來做查詢使用的，大部分使用的都是 Hbase 的基礎 Api， Hbase 與 Hadoop Hive 框架的整合還真是沒系統的搞過，話不多說，先看看本文的架構圖：

PS：文中提到的程式碼見最後參考資料

著重點在前兩部分，後面的都是大家比較熟悉的部分了。

1 Hbase 與 Hadoop 整合

Hbase 與 Hadoop 相關操作主要可以分為如下三種情況：

一張 hbase 表資料匯入另一張 hbase 表
HDFS 資料匯入 Hbase 表
HDFS 資料（超大資料）匯入 Hbase 表

以上三種情況的資料遷移基本都是依靠 MR 程式來完成的，所以重點又回到了 MR 程式設計。

01 hbase表資料匯入

思路：準備 MR 程式將一張 Hbase 表寫入到另一張 Hbase 表即可。

注意：兩張 Hbase 表匯入資料的列族資訊要一致；有資料的 Hbase 在讀入資料時要注意非空判斷。

準備工作：

準備 user1 表列族為 f1，f1 中有 age ，name屬性，作為輸入表；

準備 user2 表，建立列族 f1，作為輸出表。

主要程式碼：

Mapper 端：這裡注意繼承的是 TableMapper

 1 public class HBaseReadMapper extends TableMapper<Text,Put> {
 2         /**
 3          *
 4          * @param key rowkey
 5          * @param value rowkey 此行的資料  Result 型別
 6          * @param context
 7          * @throws IOException
 8          * @throws InterruptedException
 9          */
10         @Override
11         protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
12             //獲得rowkey 的位元組陣列
13             byte[] rowkey_bytes = key.get();
14             String rowKeyStr = Bytes.toString(rowkey_bytes);
15             //準備好 put 物件 用於輸出下游
16             Put put = new Put(rowkey_bytes);
17             //text 作為輸出的 key
18             Text text = new Text(rowKeyStr);
19             //輸出資料 - 寫資料 - 普通 構建put 物件
20             Cell[] cells = value.rawCells();
21             //將 f1 : name & age 輸出
22             for (Cell cell : cells) {
23                 //當前 cell是否是 f1
24                 //獲取列族
25                 byte[] family = CellUtil.cloneFamily(cell);
26                 String familyStr = Bytes.toString(family);
27 
28                 if("f1".equals(familyStr)){
29                     //在判斷是否是 name | age
30                     put.add(cell);
31                 }
32 
33                 if("f2".equals(familyStr)){
34                     put.add(cell);
35                 }
36             }
37           //注意非空判斷 不然會報錯
38             if(!put.isEmpty()){
39                 context.write(text,put);
40             }
41 
42         }
43     }

Reduce 端，使用 TableReducer：

 1 public class HbaseWriteReducer extends TableReducer<Text,Put,ImmutableBytesWritable> {
 2 
 3     /**
 4      * 將 map 傳過來的資料寫出去
 5      * @param key
 6      * @param values
 7      * @param context
 8      * @throws IOException
 9      * @throws InterruptedException
10      */
11     @Override
12     protected void reduce(Text key, Iterable<Put> values, Context context) throws IOException, InterruptedException {
13         //設定rowkey
14         ImmutableBytesWritable immutableBytesWritable = new ImmutableBytesWritable();
15         //設定rowkey
16         immutableBytesWritable.set(key.toString().getBytes());
17         for (Put value : values) {
18             context.write(immutableBytesWritable,value);
19         }
20     }
21 }

啟動類，將 user1 中 f1 列族下 age，name數值寫入到 user2 中：

 1 public class Hbase2HbaseMR extends Configured implements Tool {
 2 
 3    public static void main(String[] args) throws Exception {
 4        Configuration configuration = HBaseConfiguration.create();
 5        //設定 hbase 的zk地址
 6        configuration.set("hbase.zookeeper.quorum","hadoop102:2181,hadoop103:2181,hadoop104:2181");
 7        int run = ToolRunner.run(configuration, new Hbase2HbaseMR(), args);
 8        System.exit(run);
 9    }
10     @Override
11     public int run(String[] strings) throws Exception {
12         Job job = Job.getInstance(super.getConf());
13         job.setJarByClass(Hbase2HbaseMR.class);
14         //mapper
15         TableMapReduceUtil.initTableMapperJob(TableName.valueOf("user"),new Scan(), HBaseReadMapper.class,Text.class,Put.class,job);
16         //reducer
17         TableMapReduceUtil.initTableReducerJob("user2",HbaseWriteReducer.class,job);
18         boolean b = job.waitForCompletion(true);
19 
20         return b?0:1;
21     }
22 }

02 HDFS 匯入到Hbase

思路：準備 MR 程式將 HDFS 資料寫入到另一張 Hbase 表即可。

注意：

讀入的是 Mapper 是 HDFS 操作，寫出的 Reduce 是 Hbase 操作；

HDFS 資料格式要與 Hbase 表對應

準備工作：

準備 HDFS 上資料；

準備 user2 表，建立列族 f1，作為輸出表。

主要程式碼：

Mapper 端，使用常規 Mapper

 1 public class HdfsMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
 2 
 3         /**
 4          * HDFS -- Hbase
 5          *
 6          * @param key
 7          * @param value
 8          * @param context
 9          * @throws IOException
10          * @throws InterruptedException
11          */
12         @Override
13         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
14             //資料原樣輸出
15             context.write(value,NullWritable.get());
16         }
17     }

Reduce 端，使用 TableReducer ：

 1 public static class HBASEReducer extends TableReducer<Text,NullWritable,ImmutableBytesWritable>{
 2         @Override
 3         protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
 4             /**
 5              * key --> 一行資料
 6              * 樣例資料：
 7              * 07 zhangsan 18
 8              * 08 lisi 25
 9              * 09 wangwu 20
10              *
11              */
12             //按格式拆分
13             String[] split = key.toString().split("\t");
14             //構建 put 物件
15             Put put = new Put(Bytes.toBytes(split[0]));
16             put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());
17             put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());
18             context.write(new ImmutableBytesWritable(split[0].getBytes()),put);
19         }
20     }
21

啟動類：

 1 public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
 2             Configuration configuration = HBaseConfiguration.create();
 3             //設定 hbase zk 地址
 4             configuration.set("hbase.zookeeper.quorum","hadoop102:2181,hadoop103:2181,hadoop104:2181");
 5             Job job = Job.getInstance(configuration);
 6             job.setJarByClass(Hdfs2HbaseMR.class);
 7             //輸入檔案路徑
 8             FileInputFormat.addInputPath(job,new Path("hdfs://hadoop102:9000/hbase/input"));
 9             job.setMapperClass(HdfsMapper.class);
10             job.setMapOutputKeyClass(Text.class);
11             job.setMapOutputValueClass(NullWritable.class);
12             //指定輸出到 Hbase 的 表名
13             TableMapReduceUtil.initTableReducerJob("user2",HBASEReducer.class,job);
14             //設定 reduce 個數
15             job.setNumReduceTasks(1);
16             boolean b = job.waitForCompletion(true);
17             System.exit(b?0:1);
18         }

03 HDFS 大資料匯入Hbase

思路：與 2 中的資料匯入不同的是這次的資料量比較大，使用常規的 MR 可能耗時非常的長，並且一直佔用資源。

我們可以先將 Hadoop 上儲存的 HDFS 檔案轉換成 HFile 檔案，HFile 檔案就是 Hbase 底層儲存的型別，轉換完成後，再將轉換好的 HFile 檔案指定給對應的 Hbase 表即可。這就是 bulkload 的方式批量載入資料，大致流程如下：

注意：

由於是檔案型別轉換，不做計算操作，所以只需要讀入的 Mapper 操作，，不需要Reduce操作；

檔案型別轉換後還需要做 Hbase 表與 HFile 檔案的對映

準備工作：

準備 HDFS 上資料；

準備 user2 表，建立列族 f1，作為輸出表。

主要程式碼：

Mapper 端，使用常規 Mapper

 1 public class Hdfs2HFileMapper extends Mapper<LongWritable,Text,ImmutableBytesWritable,Put> {
 2     @Override
 3     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 4         String[] split = value.toString().split("\t");
 5         //封裝輸出型別
 6         Put put = new Put(split[0].getBytes());
 7         put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());
 8         put.addColumn("f1".getBytes(),"age".getBytes(),split[2].getBytes());
 9         // 將封裝好的put物件輸出，rowkey 使用 immutableBytesWritable
10         context.write(new ImmutableBytesWritable(Bytes.toBytes(split[0])),put);
11     }
12 }

啟動類：

 1 /**
 2  *
 3  * 將HDFS檔案寫成Hfile格式輸出
 4  */
 5 public class Hdfs2HileOut extends Configured implements Tool {
 6 
 7     public static void main(String[] args) throws Exception {
 8         Configuration configuration = HBaseConfiguration.create();
 9         configuration.set("hbase.zookeeper.quorum","hadoop102:2181,hadoop103:2181,hadoop104:2181");
10         int run = ToolRunner.run(configuration, new Hdfs2HileOut(), args);
11         System.exit(run);
12     }
13     @Override
14     public int run(String[] strings) throws Exception {
15         Configuration conf = super.getConf();
16         Job job = Job.getInstance(conf);
17         job.setJarByClass(Hdfs2HileOut.class);
18         FileInputFormat.addInputPath(job,new Path("hdfs://hadoop102:9000/hbase/input"));
19 
20         job.setMapperClass(Hdfs2HFileMapper.class);
21         job.setMapOutputKeyClass(ImmutableBytesWritable.class);
22         job.setMapOutputValueClass(Put.class);
23         Connection connection = ConnectionFactory.createConnection(conf);
24         Table table = connection.getTable(TableName.valueOf("user2"));
25         //使MR可以向user2表中，增量增加資料
26         HFileOutputFormat2.configureIncrementalLoad(job,table,connection.getRegionLocator(TableName.valueOf("user2")));
27         //資料寫回到HDFS 寫成HFILE -》 所以指定輸出格式為Hfile
28         job.setOutputFormatClass(HFileOutputFormat2.class);
29         //HFile 輸出的路徑，用於與表對映的輸入引數
30         HFileOutputFormat2.setOutputPath(job,new Path("hdfs://hadoop102:9000/hbase/out_hfile2"));
31         //開始執行
32         boolean b = job.waitForCompletion(true);
33         return b? 0: 1;
34     }
35 }

載入類：

public class LoadHFile2Hbase {
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum","hadoop102:2181,hadoop103:2181,hadoop104:2181");
        //獲取資料庫連線
        Connection connection = ConnectionFactory.createConnection(configuration);
        Table table = connection.getTable(TableName.valueOf("user2"));
        //構建 LoadIncrementalHfiles 載入 Hfile檔案
        LoadIncrementalHFiles loadIncrementalHFiles = new LoadIncrementalHFiles(configuration);
        // 載入上一步輸出的HFile 與表做對映
        loadIncrementalHFiles.doBulkLoad(new Path("hdfs://hadoop102:9000/hbase/out_hfile2"),connection.getAdmin(),table,connection.getRegionLocator(TableName.valueOf("user2")));
    }
}

至此，HDFS 資料遷移至 Hbase 完成。

2 Hbase 與 Hive 整合

hbase 與 hive 相關的資料遷移工作分為兩種：

hive 表結果 ---> hbase 表
hbase 表資料 ---> hive 表

這部分操作沒有程式碼，在 hive 和 hbase 客戶端就能完成操作

01 準備工作

1 首先需要將 Hbase下的5個包拷貝到 hive lib 下，建議使用軟連線的形式：

ln -s /home/hadoop/module/hbase-1.2.0-cdh5.14.2/lib/hbase-client-1.2.0-cdh5.14.2.jar  /home/hadoop/module/hive-1.1.0-cdh5.14.2/lib/hbase-client-1.2.0-cdh5.14.2.jar   
ln -s /home/hadoop/module/hbase-1.2.0-cdh5.14.2/lib/hbase-hadoop2-compat-1.2.0-cdh5.14.2.jar  /home/hadoop/module/hive-1.1.0-cdh5.14.2/lib/hbase-hadoop2-compat-1.2.0-cdh5.14.2.jar
ln -s home/hadoop/module/hbase-1.2.0-cdh5.14.2/lib/hbase-hadoop-compat-1.2.0-cdh5.14.2.jar       /home/hadoop/module/hive-1.1.0-cdh5.14.2/lib/hbase-hadoop-compat-1.2.0-cdh5.14.2.jar          
ln -s home/hadoop/module/hbase-1.2.0-cdh5.14.2/lib/hbase-it-1.2.0-cdh5.14.2.jar       /home/hadoop/module/hive-1.1.0-cdh5.14.2/lib/hbase-it-1.2.0-cdh5.14.2.jar   
ln -s home/hadoop/module/hbase-1.2.0-cdh5.14.2/lib/hbase-server-1.2.0-cdh5.14.2.jar        /home/hadoop/module/hive-1.1.0-cdh5.14.2/lib/hbase-server-1.2.0-cdh5.14.2.jar

2 修改 Hive 的配置檔案 hive-site.xml 新增自己的 zk 資訊：

<property>
    <name>hive.zookeeper.quorum</name>
    <value>hadoop102,hadoop103,hadoop104</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop102,hadoop103,hadoop104</value>
  </property>

3 修改 Hive 的配置檔案 hive-env.sh 新增如下資訊:

export HADOOP_HOME=/kkb/install/hadoop-2.6.0-cdh5.14.2/
export HBASE_HOME=/kkb/install/servers/hbase-1.2.0-cdh5.14.2
export HIVE_CONF_DIR=/kkb/install/hive-1.1.0-cdh5.14.2/conf

至此準備工作完成。

02 hive表匯入hbase

hive 中建立管理表（內部表）與hbase 表完成對映則hive管理表的資料會新增到 hbase 表中，命令如下：

create table course.hbase_score(id int,cname string,score int) 
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" = "cf:name,cf:score") 
tblproperties("hbase.table.name" = "hbase_score");

從命令中可以看出 hbase.table.name 是指的 hbase 表名，hbase.columns.mapping 則值的對應列族下的欄位，而 hive 表的 id 則會作為hbase表的 rowkey 進行儲存。

通過向內部表插入資料即可完成資料查詢結果的匯入。

insert overwrite table course.hbase_score select id,cname,score from course.score;

最後檢視 hbase 表即可看到資料。

03 hbase表匯入hive

hbase 結果對映到 hive表比較簡單，建立 hive 外部表即可：

CREATE external TABLE hbase2hive(id int, name string, score int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:name,cf:score") 
TBLPROPERTIES("hbase.table.name" ="hbase_hive_score");

從命令中可以看出 hbase.table.name 是指的 hbase 表名，hbase.columns.mapping 的值則對應hive表的欄位，而 hive 表的 id 則會作取 hbase表的 rowkey 進行儲存。

至此，Hbase 與 Hive 的資料遷移就完成了。

3 Hbase 協處理器和基礎 api

關於基礎api這部分比較詳細的介紹就在程式碼中了，再此我們就簡單說一下Hbase 協處理器。

協處理器是為了解決Hbase早期版本的一些問題，如建立二次索引、複雜過濾器、求和計數分組計數等類sql操作以及訪問控制等。

Hbase 提供兩類協處理器：

observer 類似資料庫的觸發器，個人理解類似攔截器的功能；
endpoint 類似資料庫的儲存過程，可以實現類sql的統計操作。

協處理器的載入方式

01 靜態載入實現

通過修改 hbase-site.xml 這個檔案來實現，如啟動全域性 aggregation，能過操縱所有的表資料。只需要在hbase-site.xml裡面新增以下配置即可，修改完配置之後需要重啟HBase叢集。

<property>
  <name>hbase.coprocessor.user.region.classes</name>
  <value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value>
</property>

為所有table載入了一個 cp class，可以用” ,”分割載入多個 class。

02 動態載入實現

啟用表aggregation，只對特定的表生效。

下面以協處理器 observer 為例來簡單說下操作過程：

1 建立兩張 hbase 表，user1 ，user2：

create 'user1','info;
create 'user2','info';

2 協處理器程式碼開發，完成往 user1 表插入資料時，先往 user2 表插入資料，程式碼如下：

 1 public class MyProcessor extends BaseRegionObserver {
 2     @Override
 3     public void prePut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
 4 //獲取連線
 5         Configuration configuration = HBaseConfiguration.create();
 6         configuration.set("hbase.zookeeper.quorum","hadoop102:2181,hadoop103:2181:hadoop104:2181");
 7         Connection connection = ConnectionFactory.createConnection(configuration);
 8 //涉及多個版本得問題
 9         List<Cell> cells = put.get("info".getBytes(), "name".getBytes());
10 //將user1表的name 資料也插入到 user2 中
11         Cell nameCell = cells.get(0);
12         Put put1 = new Put(put.getRow());
13         put1.add(nameCell);
14         Table table = connection.getTable(TableName.valueOf("user2"));
15         table.put(put1);
16         table.close();
17         connection.close();
18     }
19 }

3 將開發好的專案打包上傳到 HDFS ，路徑自定，假設是：

hdfs://hadoop102:9000/processor/processor.jar

4 將 jar 包掛載到 user1 表：

disable 'user1';
alter 'user1',METHOD => 'table_att','Coprocessor'=>'hdfs://hadoop102:9000/processor/processor.jar|com.bigdata.comprocessor.MyProcessor|1001|';
enabled 'user1';

com.bigdata.comprocessor.MyProcessor : 你程式的全類名；

1001 ：協處理器編號，自定義即可，表中協處理器的編號不能重複。

5 測試向 user1 中插入資料，user2 是否有資料：

 1 public class TestObserver {
 2 
 3     @Test
 4     public void testPut() throws IOException {
 5 
 6         //獲取連線
 7         Configuration configuration = HBaseConfiguration.create();
 8         configuration.set("hbase.zookeeper.quorum", "hadoop102:2181,hadoop103:2181,hadoop104:2181");
 9         //建立連線物件
10         Connection connection = ConnectionFactory.createConnection(configuration);
11         Table proc1 = connection.getTable(TableName.valueOf("user1"));
12         Put put = new Put("1110001112".getBytes());
13 
14         put.addColumn("info".getBytes(),"name".getBytes(),"hello".getBytes());
15         put.addColumn("info".getBytes(),"gender".getBytes(),"male".getBytes());
16         put.addColumn("info".getBytes(),"nationality".getBytes(),"test".getBytes());
17         proc1.put(put);
18         proc1.close();
19         connection.close();
20         System.out.println("success");
21 
22     }
23 }

關於協處理器解除安裝：

disable 'user1'
alter 'user1',METHOD=>'table_att_unset',NAME=>'coprocessor$1'
enable 'user1'

協處理器 observer 大致開發流程就是這樣的。關於基礎 api 放在參考資料的專案中了。

至此，還留有一個問題就是 hbase 的 endpoint 協處理器,其實它解決的問題及時實現 min、 max、 avg、 sum、 distinct、 group by 等sql功能，這個問題我們放在下期，下期介紹一個基於 hbase 框架之上的框架 -- phoenix，Phoenix之於 Hbase ，就像 hive 之於 Hadoop，會完美的實現 hbase 的 sql 查詢操作。

專案程式碼地址： https://github.com/fanpengyi/hbase-api

-- THE END --

Hbase 整合 Hadoop 的資料遷移

observer 類似資料庫的觸發器，個人理解類似攔截器的功能；

endpoint 類似資料庫的儲存過程，可以實現類sql的統計操作。

Hbase 整合 Hadoop 的資料遷移

hadoop 資料遷移

Hadoop資料遷移MaxCompute最佳實踐

HBase跨版本資料遷移總結

Hbase叢集間資料遷移方法總結

把kafka資料從hbase遷移到hdfs，並按天載入到hive表(hbase與hadoop為不同叢集)

【HBase】HBase各功能元件、整合MapReduce的方式及資料遷移

大資料之Spark（二）--- RDD，RDD變換，RDD的Action，解決spark的資料傾斜問題，spark整合hadoop的HA

Hadoop Hive與Hbase整合+thrift

啟動Zookeeper + Hadoop + Hbase + Kafka大資料服務

hadoop叢集間資料遷移工具

大資料之hbase（四） --- rowkey設計原則模擬通話日誌，BloomFilter，phonix環境部署，hive-hbase整合

【Hadoop & Hbase】大資料儲存系統程式設計_讀出寫入

獨立模式、偽分佈模式、hbase和hadoop HA（高可用）完全分散式模式。整合配置

解決方案：hbase資料遷移發生ERROR: Unknown table錯誤

hadoop叢集間資料遷移

HBase之——叢集資料手動遷移方案

【伊利丹】Hadoop-2.5.0-CDH5.2.0/Hive與Hbase整合實驗

Hbase 手動資料遷移方法

Impala006-impala 資料處理&impala 和 HBase整合&JDBC方式訪問Impala

Hbase 整合 Hadoop 的資料遷移

observer 類似資料庫的觸發器，個人理解類似攔截器的功能；

endpoint 類似資料庫的儲存過程，可以實現類sql的統計操作。

相關推薦