1. 程式人生 > >udf開發——解hive外表中的pb二進位制資料

udf開發——解hive外表中的pb二進位制資料

目標:hbase中有一張表,為了提高儲存效率使用pb的二進位制方式儲存;現在hive上建了一個外表,需要寫一個udf解pb的二進位制資料。

一、hbase中儲存的資料先用pb生成二進位制,轉成string後再使用base64編碼:

1、在hive中建立外表,結構如下:

create external table ext_toutiao_feed_incr (f_id string,tagPb string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
"hbase.columns.mapping" = ":key,data:tagPb" 
)TBLPROPERTIES ("hbase.table.name" = "toutiao_feed_incr");

hive> desc ext_toutiao_feed_incr;
OK
f_id                	string              	from deserializer   
tagpb               	string              	from deserializer


1)hbase檢視一條內容:檢視一條資料:

hbase(main):003:0> get 'toutiao_feed_incr',10000000570
COLUMN                         CELL                                                                                  
 data:tagPb                    timestamp=1482862346773, value=CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHue
                               bm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/                                         
2 row(s) in 0.4400 seconds

2)hive上檢視一條資料:

hive> select * from ext_toutiao_feed_incr where f_id=10000000570;     
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570	CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/
Time taken: 36.179 seconds, Fetched: 1 row(s)

3)使用java解該pb:

fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]

2、使用udf執行結果:

add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_lx as'com.abc.ttbrain.log.manager.hive.DecodePbUdf';

hive> select *,udf_pb_lx(tagpb) from ext_toutiao_feed_incr where f_id=10000000570;                         
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570	CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]

二、hbase中儲存的資料直接用pb生成二進位制:

1、在hive上建立外表,結構如下:

create external table ext_test (f_id string,tagPb BINARY,tag string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
"hbase.columns.mapping" = ":key,data:tagPb,data:tagPb" 
)TBLPROPERTIES ("hbase.table.name" = "test_liu");

hive> desc ext_test;
OK
f_id                	string              	from deserializer   
tagpb               	binary              	from deserializer   
Time taken: 0.164 seconds, Fetched: 2 row(s)


1)在hbase上查詢:

hbase(main):037:0> scan 'test_liu'
ROW                            COLUMN+CELL                                                                           
 10000000570                   column=data:tagPb, timestamp=1491884382969, value=\x08\xBA\xCC\xAF\xA0%\x12\x0D\x0A\x0
                               6\xE5\xB9\xBC\xE5\x84\xBF\x15\x04s\xA4>\x12\x0D\x0A\x06\xE7\xB1\xBB\xE5\x9E\x8B\x15\x0
                               1\xDE\xA2>\x12%\x0A\x1E\xE7\x9B\x9B\xE4\xB8\x96\xE9\xAA\x84\xE9\x98\xB3\xE8\x8B\xB1\xE
                               6\x96\x87\xE7\xAB\xA5\xE8\xB0\xA3\xE5\xA4\xA7\xE5\x85\xA8\x15_\xD2\x08?               
1 row(s) in 0.0080 seconds


2)hive上檢視一條資料:

hive> select * from ext_test;
OK
10000000570    �̯�% 
幼兒s�> 
型別ޢ>%
盛世驕陽英文童謠大全_.?�̯�% 
幼兒s�> 
型別ޢ>%
盛世驕陽英文童謠大全_.?
Time taken: 0.11 seconds, Fetched: 1 row(s)


2、使用udf執行結果:

add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;

create temporary function udf_pb_kevinliu as'com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte';

1)正常:

hive> select udf_pb_kevinliu(tagPb,'') from ext_test;

Total jobs = 1

...

Total MapReduce CPU Time Spent: 4 seconds 40 msec

OK

fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]

2)錯誤1:

hive> select udf_pb_kevinliu(tag) from ext_test;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1490153150757_1824274, Tracking URL = http://hadoop-jy-resourcemanager01:8088/proxy/application_1490153150757_1824274/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1490153150757_1824274
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-11 15:41:17,541 Stage-1 map = 0%,  reduce = 0%
2017-04-11 15:41:29,747 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.51 sec
MapReduce Total cumulative CPU time: 3 seconds 510 msec
Ended Job = job_1490153150757_1824274
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.51 sec   HDFS Read: 278 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
OK


3)錯誤2:

hive> select udf_pb_kevinliu(tagPb) from ext_test;
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments 'tagPb': No matching method for class com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte with (binary). Possible choices: _FUNC_(binary)  _FUNC_(binary, string)  _FUNC_(string)  

3、總結:

hbase中是使用pb的二進位制直接寫入其中的,到hbase中的,在hive上建立外表,使用binary和string分別去對映hbase的列;發現問題:

1)string型別是無法對應hbase中pb二進位制寫入的資料;

2)binary型別,寫udf時必須要用兩個引數,一個引數會莫名其妙報錯,這可能是hive的一個bug。

所以,儘量對pb生成的二進位制做一次base64.