Hadoop與MongoDB整合（Hive篇）

阿新 • • 發佈：2019-01-23

1.背景

公司希望使用MongoDB作為後端業務資料庫，使用Hadoop平臺作為資料平臺。最開始是先把資料從MongoDB匯出來，然後傳到HDFS，然後用Hive/MR處理。我感覺這也太麻煩了，現在不可能沒有人想到這個問題，於是就搜了一下，結果真找到一個MongoDB Connector for Hadoop

2.MongoDB簡介–摘自鄒貴金的《mongodb》一書

NoSQL資料庫與傳統的關係型資料庫相比，它具有操作簡單、完全免費、原始碼公開、隨時下載等特點，並可以用於各種商業目的。這使NoSQL產品廣泛應用於各種大型入口網站和專業網站，大大降低了運營成本。
2010年，隨著網際網路Web2.0網站的興起，NoSQL在國內掀起一陣熱潮，其中風頭最勁的莫過於MongoDB了。越來越多的業界公司已經將MongoDB投入實際的生產環境，很多創業團隊也將MongoDB作為自己的首選資料庫，創造出非常之多的移動網際網路應用。
MongoDB的文件模型自由靈活，可以讓你在開發過程中暢順無比。對於大資料量、高併發、弱事務的網際網路應用，MongoDB可以應對自如。MongoDB內建的水平擴充套件機制提供了從百萬到十億級別的資料量處理能力，完全可以滿足Web2.0和移動網際網路的資料儲存需求，其開箱即用的特性也大大降低了中小型網站的運維成本。

3.Hadoop HA叢集搭建與Hive安裝

4.正式開始

Installation

Obtain the MongoDB Hadoop Connector. You can either build it or download the jars. For Hive, you'll need the "core" jar and the "hive" jar.
Get a JAR for the MongoDB Java Driver. The connector requires at least version 3.0.0 of the driver "uber" jar (called "mongo-java-driver.jar").
In your Hive script, use ADD JAR commands to include these JARs (core, hive, and the Java driver), e.g., ADD JAR /path-to/mongo-hadoop-hive-<version>.jar;.

Requirements
Supported Hadoop and Hive versions

As of August 2013, only Hive versions <= 0.10 are stable. Mongo-Hadoop currently supports Hive versions >= 0.9. Some classes and functions are deprecated in Hive 0.11, but they’re still functional.

Hadoop versions greater than 0.20.x are supported. CDH4 is supported, but CDH3 with its native Hive 0.7 is not. However, CDH3 is compatible with newer versions of Hive. Installing a non-native version with CDH3 can be used with Mongo-Hadoop.
1.版本一定要按它要求的來，jar包去

http://mvnrepository.com/下載就可以了，使用Hive只需要三個：
mongo-hadoop-core-1.5.1.jar
mongo-hadoop-hive-1.5.1.jar
mongo-java-driver-3.2.1.jar
2.將jar包拷到 HADOOPHOME/lib與{HIVE_HOME}/lib下，然後啟動Hive，加入jar包

[[email protected] ~]$ hive

Logging initialized using configuration in jar:file:/home/hadoop/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> add jar /home/hadoop/opt/hive/lib/mongo-hadoop-core-1.5.1.jar;#三個都加，我這就不寫了。

3.Hive Usage有兩種連線方式:

其一MongoDB-based 直接連線hidden節點，使用 com.mongodb.hadoop.hive.MongoStorageHandler做資料Serde
其二BSON-based 將資料dump成bson檔案，上傳到HDFS系統，使用 com.mongodb.hadoop.hive.BSONSerDe

MongoDB-based方式

hive> CREATE TABLE eventlog
    > ( 
    >   id string,
    >   userid string,
    >   type string,
    >   objid string,
    >   time string,
    >   source string
    > )
    > STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' 
    > WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id"}') 
    > TBLPROPERTIES('mongo.uri'='mongodb://username:[email protected]:port/xxx.xxxxxx');
hive> select * from eventlog limit 10;
OK
5757c2783d6b243330ec6b25    NULL    shb NULL    2016-06-08 15:00:07 NULL
5757c27a3d6b243330ec6b26    NULL    shb NULL    2016-06-08 15:00:10 NULL
5757c27e3d6b243330ec6b27    NULL    shb NULL    2016-06-08 15:00:14 NULL
5757c2813d6b243330ec6b28    NULL    shb NULL    2016-06-08 15:00:17 NULL
5757ee443d6b242900aead78    NULL    shb NULL    2016-06-08 18:06:59 NULL
5757ee543d6b242900aead79    NULL    smb NULL    2016-06-08 18:07:16 NULL
5757ee553d6b242900aead7a    NULL    cmcs    NULL    2016-06-08 18:07:17 NULL
5757ee593d6b242900aead7b    NULL    vspd    NULL    2016-06-08 18:07:21 NULL
575b73b2de64cc26942c965c    NULL    shb NULL    2016-06-11 10:13:06 NULL
575b73b5de64cc26942c965d    NULL    shb NULL    2016-06-11 10:13:09 NULL
Time taken: 0.101 seconds, Fetched: 10 row(s)

這時HDFS裡是沒有存任何資料的，只有與表名一樣的資料夾
這裡寫圖片描述
當你處理的時候，它是直接處理mongo裡最新的資料。當然，如果你想存到HDFS裡也可以，用CTAS語句就可以。
hive> create table qsstest as select * from eventlog limit 10;

還可以下載下來呢
PS：mongo的使用者要有讀寫許可權，jar包別忘了拷進去！
另一種方式我感覺有點沒必要，沒試，但我找到一篇部落格詳細寫了。
下面轉自：MongoDB與Hadoop技術棧的整合應用
BSON-based方式

BSON-based需要先將資料dump出來，但這個時候的dump與export不一樣，不需要關心具體的資料內容,不需要指定fields list.

mongodump --host=datatask01:29017 --db=test --collection=ldc_test --out=/tmp
hdfs dfs -mkdir /dev_test/dli/bson_demo/
hdfs dfs -put /tmp/test/ldc_test.bson /dev_test/dli/bson_demo/
- 建立對映表
create external table temp.ldc_test_bson
(
  id string,
  fav_id array<int>,
  info struct<github:string, location:string>
)
ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerDe"
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"id","fav_id":"fav_id","info.github":"info.github","info.location":"info.location"}')
STORED AS INPUTFORMAT "com.mongodb.hadoop.mapred.BSONFileInputFormat"
OUTPUTFORMAT "com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat"
location '/dev_test/dli/bson_demo/';

OK,我們先來看下query的結果

0: jdbc:hive2://hd-cosmos-01:10000/default> select * from temp.ldc_test_mongo;
+--------------------+------------------------+-----------------------------------------------------------------+--+
| ldc_test_mongo.id  | ldc_test_mongo.fav_id  |                       ldc_test_mongo.info                       |
+--------------------+------------------------+-----------------------------------------------------------------+--+
| @Tony_老七           | [3,33,333,3333,33333]  | {"github":"https://github.com/tonylee0329","location":"SH/CN"}  |
+--------------------+------------------------+-----------------------------------------------------------------+--+
1 row selected (0.345 seconds)

這樣資料就儲存在一個table，使用中如果需要開啟陣列，可以這樣

SELECT id, fid
FROM temp.ldc_test_mongo LATERAL VIEW explode(fav_id) favids AS fid;
-- 訪問struct結構資料 
select id, info.github from temp.ldc_test_mongo

//根據不同的資料型別進行反序列操作，複雜型別在內容做element的迴圈，最終呼叫的都是對原子型別的操作.
 public Object deserializeField(final Object value, final TypeInfo valueTypeInfo, final String ext) {
        if (value != null) {
            switch (valueTypeInfo.getCategory()) {
                case LIST:
                    return deserializeList(value, (ListTypeInfo) valueTypeInfo, ext);
                case MAP:
                    return deserializeMap(value, (MapTypeInfo) valueTypeInfo, ext);
                case PRIMITIVE:
                    return deserializePrimitive(value, (PrimitiveTypeInfo) valueTypeInfo);
                case STRUCT:
                    // Supports both struct and map, but should use struct 
                    return deserializeStruct(value, (StructTypeInfo) valueTypeInfo, ext);
                case UNION:
                    // Mongo also has no union
                    LOG.warn("BSONSerDe does not support unions.");
                    return null;
                default:
                    // Must be an unknown (a Mongo specific type)
                    return deserializeMongoType(value);
            }
        }
        return null;
    }

// 轉為java的原子型別儲存.
 private Object deserializePrimitive(final Object value, final PrimitiveTypeInfo valueTypeInfo) {
        switch (valueTypeInfo.getPrimitiveCategory()) {
            case BINARY:
                return value;
            case BOOLEAN:
                return value;
            case DOUBLE:
                return ((Number) value).doubleValue();
            case FLOAT:
                return ((Number) value).floatValue();
            case INT:
                return ((Number) value).intValue();
            case LONG:
                return ((Number) value).longValue();
            case SHORT:
                return ((Number) value).shortValue();
            case STRING:
                return value.toString();
            case TIMESTAMP:
                if (value instanceof Date) {
                    return new Timestamp(((Date) value).getTime());
                } else if (value instanceof BSONTimestamp) {
                    return new Timestamp(((BSONTimestamp) value).getTime() * 1000L);
                } else if (value instanceof String) {
                    return Timestamp.valueOf((String) value);
                } else {
                    return value;
                }
            default:
                return deserializeMongoType(value);
        }
    }

Hadoop與MongoDB整合（Hive篇）

1.背景

2.MongoDB簡介–摘自鄒貴金的《mongodb》一書

3.Hadoop HA叢集搭建與Hive安裝

4.正式開始

Hadoop與MongoDB整合（Hive篇）

Spring Data 與MongoDB 整合五:操作篇（分頁）

Spring Data 與MongoDB 整合四:操作篇（查詢）

輕輕鬆鬆學習SpringBoot2：第二十五篇： Spring Boot和Mongodb整合（完整版）

【MongoDB】NoSQL Manager for MongoDB 教程（基礎篇）

『中級篇』docker之CI/CD持續整合-（終結篇）（77）

框架整合——SpringMVC與MyBatis整合（超詳細）

影象腐蝕與影象膨脹（Python篇）

WebService入門 - CXF與Spring整合（maven專案）

Memcached客戶端（memcached-Java-client）與 Spring整合（單伺服器）

阿里雲clouder認證—雲資料庫管理與資料遷移（實戰篇）

spring-cloud與netflixEureka整合（註冊中心）

kafka生產者消費者API 與sparkStreaming 整合（scala版）

高德APP啟動耗時剖析與優化實踐（iOS篇）

Hive與HBase整合（例項）

SpringBoot和hadoop元件Hive的整合（填坑）

android與C# WebService基於ksoap通信（C#篇）

【iOS與EV3混合機器人編程系列之二】工欲善其事，必先利其器（準備篇）

XSS的原理分析與解剖：第三章（技巧篇）未看***

數組與集合（基礎篇）

Hadoop與MongoDB整合（Hive篇）

1.背景

2.MongoDB簡介–摘自鄒貴金的《mongodb》一書

3.Hadoop HA叢集搭建與Hive安裝

4.正式開始

相關推薦