udf開發——解hive外表中的pb二進位制資料
目標:hbase中有一張表,為了提高儲存效率使用pb的二進位制方式儲存;現在hive上建了一個外表,需要寫一個udf解pb的二進位制資料。
一、hbase中儲存的資料先用pb生成二進位制,轉成string後再使用base64編碼:
1、在hive中建立外表,結構如下:
create external table ext_toutiao_feed_incr (f_id string,tagPb string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,data:tagPb" )TBLPROPERTIES ("hbase.table.name" = "toutiao_feed_incr"); hive> desc ext_toutiao_feed_incr; OK f_id string from deserializer tagpb string from deserializer
1)hbase檢視一條內容:檢視一條資料:
hbase(main):003:0> get 'toutiao_feed_incr',10000000570 COLUMN CELL data:tagPb timestamp=1482862346773, value=CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHue bm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ 2 row(s) in 0.4400 seconds
2)hive上檢視一條資料:
hive> select * from ext_toutiao_feed_incr where f_id=10000000570; WARNING: Comparing a bigint and a string may result in a loss of precision. Total jobs = 1 ... OK 10000000570 CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ Time taken: 36.179 seconds, Fetched: 1 row(s)
3)使用java解該pb:
fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]
2、使用udf執行結果:
add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_lx as'com.abc.ttbrain.log.manager.hive.DecodePbUdf';
hive> select *,udf_pb_lx(tagpb) from ext_toutiao_feed_incr where f_id=10000000570;
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570 CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]
二、hbase中儲存的資料直接用pb生成二進位制:
1、在hive上建立外表,結構如下:
create external table ext_test (f_id string,tagPb BINARY,tag string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,data:tagPb,data:tagPb"
)TBLPROPERTIES ("hbase.table.name" = "test_liu");
hive> desc ext_test;
OK
f_id string from deserializer
tagpb binary from deserializer
Time taken: 0.164 seconds, Fetched: 2 row(s)
1)在hbase上查詢:
hbase(main):037:0> scan 'test_liu'
ROW COLUMN+CELL
10000000570 column=data:tagPb, timestamp=1491884382969, value=\x08\xBA\xCC\xAF\xA0%\x12\x0D\x0A\x0
6\xE5\xB9\xBC\xE5\x84\xBF\x15\x04s\xA4>\x12\x0D\x0A\x06\xE7\xB1\xBB\xE5\x9E\x8B\x15\x0
1\xDE\xA2>\x12%\x0A\x1E\xE7\x9B\x9B\xE4\xB8\x96\xE9\xAA\x84\xE9\x98\xB3\xE8\x8B\xB1\xE
6\x96\x87\xE7\xAB\xA5\xE8\xB0\xA3\xE5\xA4\xA7\xE5\x85\xA8\x15_\xD2\x08?
1 row(s) in 0.0080 seconds
2)hive上檢視一條資料:
hive> select * from ext_test;
OK
10000000570 �̯�%
幼兒s�>
型別ޢ>%
盛世驕陽英文童謠大全_.?�̯�%
幼兒s�>
型別ޢ>%
盛世驕陽英文童謠大全_.?
Time taken: 0.11 seconds, Fetched: 1 row(s)
2、使用udf執行結果:
add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_kevinliu as'com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte';
1)正常:
hive> select udf_pb_kevinliu(tagPb,'') from ext_test;
Total jobs = 1
...
Total MapReduce CPU Time Spent: 4 seconds 40 msec
OK
fid:10000000570,type:0,channels:[],tags:[{tag=幼兒, score=0.32119}, {tag=型別, score=0.3181}, {tag=盛世驕陽英文童謠大全, score=0.53446}]
2)錯誤1:
hive> select udf_pb_kevinliu(tag) from ext_test;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1490153150757_1824274, Tracking URL = http://hadoop-jy-resourcemanager01:8088/proxy/application_1490153150757_1824274/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1490153150757_1824274
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-11 15:41:17,541 Stage-1 map = 0%, reduce = 0%
2017-04-11 15:41:29,747 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.51 sec
MapReduce Total cumulative CPU time: 3 seconds 510 msec
Ended Job = job_1490153150757_1824274
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.51 sec HDFS Read: 278 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
OK
3)錯誤2:
hive> select udf_pb_kevinliu(tagPb) from ext_test;
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments 'tagPb': No matching method for class com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte with (binary). Possible choices: _FUNC_(binary) _FUNC_(binary, string) _FUNC_(string)
3、總結:
hbase中是使用pb的二進位制直接寫入其中的,到hbase中的,在hive上建立外表,使用binary和string分別去對映hbase的列;發現問題:
1)string型別是無法對應hbase中pb二進位制寫入的資料;
2)binary型別,寫udf時必須要用兩個引數,一個引數會莫名其妙報錯,這可能是hive的一個bug。
所以,儘量對pb生成的二進位制做一次base64.