1. 程式人生 > >hive1.1版本中mapjoin踩到的一個坑

hive1.1版本中mapjoin踩到的一個坑

可以通過設定hive.auto.convert.join.noconditionaltask.size這個引數來將一個小表變成hashtable然後作為分散式快取檔案分發到各個worker節點,進而實現Map side join。map side join有許多優勢,顧名思義就是沒有了reduce的過程,這樣可以用來解決join的時候資料傾斜的問題。

一般應用在大表和小表join的場景下,這樣我們把hive.auto.convert.join.noconditionaltask.size(這個值預設大小是10M)設定為超過小表大小,Hive就會把這個join自動轉換成map side join,一個類似的explain如下:

STAGE DEPENDENCIES:
  Stage-5 is a root stage
  Stage-2 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a:b 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a:b 
          TableScan
            alias: b
            Statistics: Num rows: 100 Data size: 1714 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 UDFToString(f_uid) (type: string)
                1 uid (type: string)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 385139074 Data size: 4621669010 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ('20160630' BETWEEN '20160630' AND '20160631' and (f_type) IN (101, 102)) (type: boolean)
              Statistics: Num rows: 96284768 Data size: 1155417246 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 UDFToString(f_uid) (type: string)
                  1 uid (type: string)
                outputColumnNames: _col6, _col54, _col55, _col56, _col63
                Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: concat(_col54, _col55, _col56) (type: string), _col6 (type: double), _col63 (type: string)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(DISTINCT _col1), count(DISTINCT _col2)
                    keys: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4
                    Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                      sort order: +++
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
主要看到裡面有map reduce local work,然後也能看到table scan把哪個表變成了hashtable.

但是。。。。。。該死的hive1.1.0這個版本有BUG,join的時候用兩個double型別的欄位進行連線的時候,生成hashtable的那個過程巨慢無比,詳情見:https://issues.apache.org/jira/browse/HIVE-11502

join ON a.f_uid = B.uid ,a.f_uid為double型別,B.uid為string型別,型別不同時,hive會自動將b.uid轉換成double型別來進行匹配,然後就踩中了那個坑。

解決方式是把他們都轉換成string型別來進行連線:ON cast(a.f_uid as string) = B.uid

最後再說說那個可惡的科學計數法。。。。int , float , double這些數值型別在儲存大額度數字的時候都會變成科學計數法表示,例如:

hive> select pow(10,8) from dual;

OK

1.0E8

可以通過先轉為bigint再轉為string消除:

cast(cast(a.f_uid as bigint) as string)

不過有小數位的就不行了。還好我們連線欄位是整數。