1. 程式人生 > >一起學Hive——總結常用的Hive優化技巧

一起學Hive——總結常用的Hive優化技巧

今天總結本人在使用Hive過程中的一些優化技巧,希望給大家帶來幫助。Hive優化最體現程式設計師的技術能力,面試官在面試時最喜歡問的就是Hive的優化技巧。

技巧1.控制reducer數量

下面的內容是我們每次在hive命令列執行SQL時都會打印出來的內容:

In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

很多人都會有個疑問,上面的內容是幹什麼用的。我們一一來解答,先看

set hive.exec.reducers.bytes.per.reducer=<number>,這個一條Hive命令,用於設定在執行SQL的過程中每個reducer處理的最大位元組數量。 可以在配置檔案中設定,也可以由我們在命令列中直接設定。如果處理的資料量大於 ,就會多生成一個reudcer。例如, = 1024K,處理的資料是1M,就會生成10個reducer。我們來驗證下上面的說法是否正確:

  1. 執行set hive.exec.reducers.bytes.per.reducer=200000;命令,設定每個reducer處理的最大位元組是200000。
  2. 執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

執行上面的sql時會在控制檯打印出資訊:

  Number of reduce tasks not specified. Estimated from input data size: 159
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159

控制檯列印的資訊中第一句話:Number of reduce tasks not specified. Estimated from input data size: 159。翻譯成中文:沒有指定reducer任務數量,根據輸入的資料量估計會有159個reducer任務。然後在看最後一句話:number of mappers: 1; number of reducers: 159。確定該SQL最終生成159個reducer。因此如果我們知道資料的大小,只要通過set hive.exec.reducers.bytes.per.reducer命令設定每個reducer處理資料的大小就可以控制reducer的數量。

接著看
set hive.exec.reducers.max=<number> 這也是一條Hive命令,用於設定Hive的最大reducer數量,如果我們設定number為50,表示reducer的最大數量是50。
我們來驗證下這個說法是否正確:

  1. 執行命令set hive.exec.reducers.max=8;設定reducer的數量為8。
  2. 繼續執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

會在控制檯打印出如下資訊:

Number of reduce tasks not specified. Estimated from input data size: 8
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8

控制檯列印的資訊中第一句話:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的資料量為8,正好驗證了我們的說法。set set hive.exec.reducers.max=8;命令是設定reducer的數量的上界。

最後來看set mapreduce.job.reduces= 命令。這條Hive命令是設定reducer的資料,在執行sql會生成多少個reducer處理資料。使用和上面同樣的方法來驗證set mapreduce.job.reduces= 這條命令。

  1. 執行命令set mapreduce.job.reduces=5;設定reducer的數量為8。
  2. 繼續執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

會在控制檯打印出如下資訊:

Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0026, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0026/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5

根據Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5這兩句話,可以知道生成5個reducer。

如果我們將數量由5改成15。還是執行select user_id,count(1) as cnt
from orders group by user_id limit 20;SQL,在控制檯列印的資訊是:

Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 15
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0027, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0027/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 15

可見reducer的數量已經由5變為15個。

小結,控制hive中reducer的數量由三種方式,分別是:

set hive.exec.reducers.bytes.per.reducer=<number> 
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>

其中set mapreduce.job.reduces= 的方式優先順序最高,set hive.exec.reducers.max= 優先順序次之,set hive.exec.reducers.bytes.per.reducer= 優先順序最低。從hive0.14開始,一個reducer處理檔案的大小的預設值是256M。

reducer的數量並不是越多越好,我們知道有多少個reducer就會生成多少個檔案,小檔案過多在hdfs中就會佔用大量的空間,造成資源的浪費。如果reducer數量過小,導致某個reducer處理大量的資料(資料傾斜就會出現這樣的現象),沒有利用hadoop的分而治之功能,甚至會產生OOM記憶體溢位的錯誤。使用多少個reducer處理資料和業務場景相關,不同的業務場景處理的辦法不同。

技巧2.使用Map join

sql中涉及到多張表的join,當有一張表的大小小於1G時,使用Map Join可以明顯的提高SQL的效率。如果最小的表大於1G,使用Map Join會出現OOM的錯誤。
用法:

select /*+ MAPJOIN(table_a)*/,a.*,b.* from table_a a join table_b b on a.id = b.id

技巧3.使用distinct + union all代替union

如果遇到要使用union去重的場景,使用distinct + union all比使用union的效果好。
distinct + union all的用法:

select count(distinct *) 
from (
select order_id,user_id,order_type from orders where order_type='0' union all
select order_id,user_id,order_type from orders where order_type='1' union all 
select order_id,user_id,order_type from orders where order_type='1'
)a;

union的用法:

select count(*) 
from(
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='1')t;

技巧4.解決資料傾斜的通用辦法

資料傾斜的現象:任務進度長時間維持在99%,只有少量reducer任務完成,未完成任務資料讀寫量非常大,超過10G。在聚合操作是經常發生。
通用解決方法:set hive.groupby.skewindata=true;
將一個map reduce拆分成兩個map reduce。

說說我遇到過的一個場景,需用統計某個一天每個使用者的訪問量,SQL如下:

select t.user_id,count(*) from user_log t group by t.user_id

執行這條語句之後,發現任務維持在99%達到一個小時。後面自己分析user_log表,發現user_id有很多資料為null。user_id為null的資料會有一個reducer來處理,導致出現數據傾斜的現象。解決方法有兩種:
1、通過where條件過濾掉user_id為null的記錄。
2、將為null的user_id設定一個隨機數值。保證所有資料平均的分配到所有的reducer中處理。