一起學Hive——總結常用的Hive優化技巧
今天總結本人在使用Hive過程中的一些優化技巧,希望給大家帶來幫助。Hive優化最體現程式設計師的技術能力,面試官在面試時最喜歡問的就是Hive的優化技巧。
技巧1.控制reducer數量
下面的內容是我們每次在hive命令列執行SQL時都會打印出來的內容:
In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number>
很多人都會有個疑問,上面的內容是幹什麼用的。我們一一來解答,先看
set hive.exec.reducers.bytes.per.reducer=<number>
,這個一條Hive命令,用於設定在執行SQL的過程中每個reducer處理的最大位元組數量。
- 執行set hive.exec.reducers.bytes.per.reducer=200000;命令,設定每個reducer處理的最大位元組是200000。
- 執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
執行上面的sql時會在控制檯打印出資訊:
Number of reduce tasks not specified. Estimated from input data size: 159 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/ Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159
控制檯列印的資訊中第一句話:Number of reduce tasks not specified. Estimated from input data size: 159。翻譯成中文:沒有指定reducer任務數量,根據輸入的資料量估計會有159個reducer任務。然後在看最後一句話:number of mappers: 1; number of reducers: 159。確定該SQL最終生成159個reducer。因此如果我們知道資料的大小,只要通過set hive.exec.reducers.bytes.per.reducer命令設定每個reducer處理資料的大小就可以控制reducer的數量。
接著看
set hive.exec.reducers.max=<number>
這也是一條Hive命令,用於設定Hive的最大reducer數量,如果我們設定number為50,表示reducer的最大數量是50。
我們來驗證下這個說法是否正確:
- 執行命令set hive.exec.reducers.max=8;設定reducer的數量為8。
- 繼續執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
會在控制檯打印出如下資訊:
Number of reduce tasks not specified. Estimated from input data size: 8
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8
控制檯列印的資訊中第一句話:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的資料量為8,正好驗證了我們的說法。set set hive.exec.reducers.max=8;命令是設定reducer的數量的上界。
最後來看set mapreduce.job.reduces=
- 執行命令set mapreduce.job.reduces=5;設定reducer的數量為8。
- 繼續執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
會在控制檯打印出如下資訊:
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0026, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0026/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
根據Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5這兩句話,可以知道生成5個reducer。
如果我們將數量由5改成15。還是執行select user_id,count(1) as cnt
from orders group by user_id limit 20;SQL,在控制檯列印的資訊是:
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 15
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0027, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0027/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 15
可見reducer的數量已經由5變為15個。
小結,控制hive中reducer的數量由三種方式,分別是:
set hive.exec.reducers.bytes.per.reducer=<number>
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>
其中set mapreduce.job.reduces=
reducer的數量並不是越多越好,我們知道有多少個reducer就會生成多少個檔案,小檔案過多在hdfs中就會佔用大量的空間,造成資源的浪費。如果reducer數量過小,導致某個reducer處理大量的資料(資料傾斜就會出現這樣的現象),沒有利用hadoop的分而治之功能,甚至會產生OOM記憶體溢位的錯誤。使用多少個reducer處理資料和業務場景相關,不同的業務場景處理的辦法不同。
技巧2.使用Map join
sql中涉及到多張表的join,當有一張表的大小小於1G時,使用Map Join可以明顯的提高SQL的效率。如果最小的表大於1G,使用Map Join會出現OOM的錯誤。
用法:
select /*+ MAPJOIN(table_a)*/,a.*,b.* from table_a a join table_b b on a.id = b.id
技巧3.使用distinct + union all代替union
如果遇到要使用union去重的場景,使用distinct + union all比使用union的效果好。
distinct + union all的用法:
select count(distinct *)
from (
select order_id,user_id,order_type from orders where order_type='0' union all
select order_id,user_id,order_type from orders where order_type='1' union all
select order_id,user_id,order_type from orders where order_type='1'
)a;
union的用法:
select count(*)
from(
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='1')t;
技巧4.解決資料傾斜的通用辦法
資料傾斜的現象:任務進度長時間維持在99%,只有少量reducer任務完成,未完成任務資料讀寫量非常大,超過10G。在聚合操作是經常發生。
通用解決方法:set hive.groupby.skewindata=true;
將一個map reduce拆分成兩個map reduce。
說說我遇到過的一個場景,需用統計某個一天每個使用者的訪問量,SQL如下:
select t.user_id,count(*) from user_log t group by t.user_id
執行這條語句之後,發現任務維持在99%達到一個小時。後面自己分析user_log表,發現user_id有很多資料為null。user_id為null的資料會有一個reducer來處理,導致出現數據傾斜的現象。解決方法有兩種:
1、通過where條件過濾掉user_id為null的記錄。
2、將為null的user_id設定一個隨機數值。保證所有資料平均的分配到所有的reducer中處理。