hive資料傾斜解決方法

阿新 • • 發佈：2018-11-01

Hive的過程中經常會碰到資料傾斜問題,資料傾斜基本都發生在group、join等需要資料shuffle的操作中,這些過程需要按照key值進行資料彙集處理,如果key值過於集中,在彙集過程中大部分資料彙集到一臺機器上,這就會導致資料傾斜。

具體表現為：作業經常reduce完成在99%後一直卡住,最後的1%花了幾個小時都沒有跑完。

常見產生資料傾斜的原因：
#空值產生的資料傾斜
#不同資料型別關聯產生的資料傾斜
#關聯的key非空,但是某個key值大量重複 #distinct、count(distinct)

1、空值產生的資料傾斜場景：
如日誌中，常會有資訊丟失的問題，比如全網日誌中的user_id，如果取其中的user_id和bmw_users關聯，會碰到資料傾斜的問題。
解決方法1：
user_id為空的不參與關聯

select * 
from log a
join bmw_users b
on a.user_id is not null
and a.user_id = b.user_id
union all
select * 
from log a
where a.user_id is null;

解決方法2 ：
賦與空值分新的key值（推薦）

select * 
from logs a 
left join bmw_users b 
on case when a.user_id is null then concat(‘dp_hive’,rand() )

#把空值的key變成一個字串加上隨機數else a.user_id end = b.user_id;

2、不同資料型別關聯產生資料傾斜場景：
使用者表中user_id欄位為int,logs表中user_id欄位既有string型別也有int型別。當按照user_id進行兩個表的join操作時,預設的Hash操作會按照int型的id來進行分配,這樣會導致所有string型別的id記錄都分配到同一個reduce中。
解決方法：
把數字型別轉換成字串型別

select * 
from users a
left join logs b
on a.user_id = cast(b.user_id as string)

3、關聯的key非空,但是某個key值大量重複
解決方法：
加入隨機數

select a.key as key, b.pv as pv 
from(
select key 
from table1 
where dt='2018-06-18') a
left join(
select key, sum(pv) as pv 	 
from (
select key,	round(rand()*1000) as rnd, #加入隨機數,增加併發度		
count(1) as pv		
from table2 
where dt='2018-06-18' group by key,rnd) tmp    
group by key) b 
on a.key = b.key

4、distinct、count(distinct)
解決方法：
用group by 去重
#distinct替換:
原始sql：

select distinct key from A;

替換後的sql：

select key from A group by key;

#單維度count(distinct)替換
原始sql：

select ship_id, count(distinct order_id) as ship_order_num			
from table A			
where dt = '2018-06-18' 
group by ship_id;

替換後的sql：

select ship_id, count(1) as ship_order_num			
from 			
(select ship_id, order_id 
from table A 
where dt = '2018-06-18' 
group by ship_id, order_id) t			
group by ship_id;

#多維度count(distinct)替換 —每個維度單獨處理後關聯

hive資料傾斜解決方法

hive資料傾斜解決方法

Hive資料傾斜解決方法總結

hive-資料傾斜解決詳解

Hive資料傾斜解決辦法

Hive資料傾斜和解決辦法

Hive資料傾斜問題解決方案

織夢列表頁資料重複解決方法

表格提交後獲取到的中文資料亂碼解決方法

hive 資料傾斜的常見處理方式

HIVE資料傾斜總結

Spark專案實戰-資料傾斜解決方案之原理以及現象分析

Spark專案實戰-資料傾斜解決方案之將reduce join轉換為map join

xpath爬取過程出現不規則資料的解決方法

【轉】【Spark】Spark 資料傾斜優化方法

php表單提交時獲取不到post資料的解決方法

關於VS “ 警告 C4244 “引數”: 從“time_t”轉換到“unsigned int”，可能丟失資料 ”的解決方法

Internal Server Error 無法訪問請求的頁面，因為該頁的相關配置資料無效解決方法

JMeter“監視器結果”配置（監視器結果無資料的解決方法）

spark 大型專案實戰(五十八):資料傾斜解決方案之sample取樣傾斜key進行兩次join

資料傾斜解決方案之原理以及現象分析

hive資料傾斜解決方法

相關推薦