hive1.1版本中mapjoin踩到的一個坑

阿新 • • 發佈：2019-01-24

可以通過設定hive.auto.convert.join.noconditionaltask.size這個引數來將一個小表變成hashtable然後作為分散式快取檔案分發到各個worker節點，進而實現Map side join。map side join有許多優勢，顧名思義就是沒有了reduce的過程，這樣可以用來解決join的時候資料傾斜的問題。

一般應用在大表和小表join的場景下，這樣我們把hive.auto.convert.join.noconditionaltask.size（這個值預設大小是10M）設定為超過小表大小，Hive就會把這個join自動轉換成map side join，一個類似的explain如下：

STAGE DEPENDENCIES:
  Stage-5 is a root stage
  Stage-2 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a:b 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a:b 
          TableScan
            alias: b
            Statistics: Num rows: 100 Data size: 1714 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 UDFToString(f_uid) (type: string)
                1 uid (type: string)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 385139074 Data size: 4621669010 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ('20160630' BETWEEN '20160630' AND '20160631' and (f_type) IN (101, 102)) (type: boolean)
              Statistics: Num rows: 96284768 Data size: 1155417246 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 UDFToString(f_uid) (type: string)
                  1 uid (type: string)
                outputColumnNames: _col6, _col54, _col55, _col56, _col63
                Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: concat(_col54, _col55, _col56) (type: string), _col6 (type: double), _col63 (type: string)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(DISTINCT _col1), count(DISTINCT _col2)
                    keys: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4
                    Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                      sort order: +++
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

主要看到裡面有map reduce local work,然後也能看到table scan把哪個表變成了hashtable.

但是。。。。。。該死的hive1.1.0這個版本有BUG，join的時候用兩個double型別的欄位進行連線的時候，生成hashtable的那個過程巨慢無比，詳情見:https://issues.apache.org/jira/browse/HIVE-11502

join ON a.f_uid = B.uid ,a.f_uid為double型別，B.uid為string型別，型別不同時，hive會自動將b.uid轉換成double型別來進行匹配，然後就踩中了那個坑。

解決方式是把他們都轉換成string型別來進行連線：ON cast(a.f_uid as string) = B.uid

最後再說說那個可惡的科學計數法。。。。int , float , double這些數值型別在儲存大額度數字的時候都會變成科學計數法表示，例如：

hive> select pow(10,8) from dual;

1.0E8

可以通過先轉為bigint再轉為string消除：

cast(cast(a.f_uid as bigint) as string)

不過有小數位的就不行了。還好我們連線欄位是整數。

hive1.1版本中mapjoin踩到的一個坑

hive1.1版本中mapjoin踩到的一個坑

利用sqoop將hive資料匯入Oracle中（踩的坑）

[逆元] 1-n中改小一個使 % p == q gym101620

Magento中helper的一個坑

okhttp3 3.8.1版本中類OkHttpClient.java的原始碼剖析學習

本地jedis版本過低踩的坑

hibernate4.1版本中的新特性和hibernate3.3部分區別

記錄windows10下MongoDB安裝過程中遇到的一個坑

spark1.3編譯過程中遇到的一個坑

小程式中 radio 的一個坑，到底 checked 該賦什麼值？

hlsl shader編譯中遇到的一個坑

Struts2 學習之路（二）：2.4 Struts2(2.5.14.1版本)中的萬用字元匹配問題

【踩坑】angularJS 1.X版本中 ng-bind 指令多空格展示

實際工作場景中踩過redis的一個坑：不查詢redis，而查詢後端資料庫問題

CDH5.15.0+spark1.6.0+hive1.1叢集與zeppelin0.8.1+spark-notebook打通踩坑總結

記一個在js中使用setTimeout方法踩的坑（箭頭函式與function定義的區別）

TortoiseSVN 1.9.5安裝與 Eclipse4.4.2及以上版本中安裝SVN插件

踩到Framework7 Photo Browser 的一個坑

android studio中配置X5 webview時的一個坑

-source 1.5 中不支持泛型(請使用-source5或更高版本）

hive1.1版本中mapjoin踩到的一個坑

相關推薦