hive1.1版本中mapjoin踩到的一個坑
可以通過設定hive.auto.convert.join.noconditionaltask.size這個引數來將一個小表變成hashtable然後作為分散式快取檔案分發到各個worker節點,進而實現Map side join。map side join有許多優勢,顧名思義就是沒有了reduce的過程,這樣可以用來解決join的時候資料傾斜的問題。
一般應用在大表和小表join的場景下,這樣我們把hive.auto.convert.join.noconditionaltask.size(這個值預設大小是10M)設定為超過小表大小,Hive就會把這個join自動轉換成map side join,一個類似的explain如下:
主要看到裡面有map reduce local work,然後也能看到table scan把哪個表變成了hashtable.STAGE DEPENDENCIES: Stage-5 is a root stage Stage-2 depends on stages: Stage-5 Stage-0 depends on stages: Stage-2 STAGE PLANS: Stage: Stage-5 Map Reduce Local Work Alias -> Map Local Tables: a:b Fetch Operator limit: -1 Alias -> Map Local Operator Tree: a:b TableScan alias: b Statistics: Num rows: 100 Data size: 1714 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 UDFToString(f_uid) (type: string) 1 uid (type: string) Stage: Stage-2 Map Reduce Map Operator Tree: TableScan alias: a Statistics: Num rows: 385139074 Data size: 4621669010 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ('20160630' BETWEEN '20160630' AND '20160631' and (f_type) IN (101, 102)) (type: boolean) Statistics: Num rows: 96284768 Data size: 1155417246 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 UDFToString(f_uid) (type: string) 1 uid (type: string) outputColumnNames: _col6, _col54, _col55, _col56, _col63 Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: concat(_col54, _col55, _col56) (type: string), _col6 (type: double), _col63 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(DISTINCT _col1), count(DISTINCT _col2) keys: _col0 (type: string), _col1 (type: double), _col2 (type: string) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: double), _col2 (type: string) sort order: +++ Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
但是。。。。。。該死的hive1.1.0這個版本有BUG,join的時候用兩個double型別的欄位進行連線的時候,生成hashtable的那個過程巨慢無比,詳情見:https://issues.apache.org/jira/browse/HIVE-11502
join ON a.f_uid = B.uid ,a.f_uid為double型別,B.uid為string型別,型別不同時,hive會自動將b.uid轉換成double型別來進行匹配,然後就踩中了那個坑。
解決方式是把他們都轉換成string型別來進行連線:ON cast(a.f_uid as string) = B.uid
最後再說說那個可惡的科學計數法。。。。int , float , double這些數值型別在儲存大額度數字的時候都會變成科學計數法表示,例如:
hive> select pow(10,8) from dual;
OK
1.0E8
可以通過先轉為bigint再轉為string消除:cast(cast(a.f_uid as bigint) as string)
不過有小數位的就不行了。還好我們連線欄位是整數。