Hive針對distinct的優化

阿新 • • 發佈：2019-02-06

hive針對count(distinct xxx)只產生一個reduce的優化。

0x00 造成的原因

由於使用了distinct，導致在map端的combine無法合併重複資料；對於這種count()全聚合操作時，即使設定了reduce task個數，set mapred.reduce.tasks=100；hive也只會啟動一個reducer。這就造成了所有map端傳來的資料都在一個tasks中執行，成為了效能瓶頸。

0x01 解決方式一（分治法）

該方法優勢在於使用不同的reducer各自進行COUNT(DISTINCT)計算，充分發揮hadoop的優勢，然後進行求和，間接達到了效果。需要注意的是多個tasks同時計算產生重複值的問題，所以分組需要使用到目標列的子串。

SELECT

SUM(tmp_num) total

FROM

(select

substr(uid,1,4) tag,

count(distinct substr(uid,5)) tmp_total

from

xxtable

group by

substr(uid,1,4)

)t1

經過驗證，該方法在5000萬資料量的情況下，不優化需要5分鐘，經過優化需要3分鐘，還是有一定的提升的。

0x10 解決方式二（隨機分組法）

核心是使用group by替代count(distinct)。

SELECT--3

SUM(tc)

FROM

(select--2

count(*) tc,

tag

from

(select--1

cast(rand() * 100 as bigint) tag,

user_id

from

xxtable

group by

user_id

)t1

group by

tag

)t2;

1層使用隨機數作為分組依據，同時使用group by保證去重。
2層統計各分組下的統計數。
3層對分組結果求和。

經過驗證，該方法在5000萬資料量的情況下，不優化需要5分鐘，經過優化需要2.5分鐘，有進一步提升。

利用兩次group by 解決count distinct 資料傾斜問題：

Set hive.exec.parallel=true;
Set hive.exec.parallel.thread.number=2;
From（
Select
Yw_type,
Sum(case when type=’pv’ then ct end) as pv,
Sum(case when type=’pv’ then 1 end) as uv,
Sum(case when type=’click’ then ct end) as ipv,
Sum(case when type=’click’ then 1 end) as ipv_uv
from (
select
yw_type,log_type,uid,count(1) as ct
from (
select ‘total’ yw_type,‘pv’ log_type,uid from pv_log
union all
select ‘cat’ yw_type,‘click’ log_type,uid from click_log
) t group by yw_type,log_type
) t group by yw_type
) t
Insert overwrite table tmp_1
Select pv,uv,ipv,ipv_uv
Where yw_type=’total’
Insert overwrite table tmp_2
Select pv,uv,ipv,ipv_uv
Where yw_type=’cat’;

. 利用隨機數減少資料傾斜：

select
a.uid
from big_table_a a
left outer join big_table_b b
on b.uid = case when a.uid is null or length(a.uid)=0
then concat('rd_sid',rand()) else a.uid end;

Hive針對distinct的優化

Hive針對distinct的優化

Hive中的count(distinct)優化

hive distinct優化

MySLQ查詢優化之distinct優化

Postgresql資料庫count(distinct)優化

hive中distinct用法

Hive應用效能優化

Hive中distinct和Group by效率對比及處理方式

hive count distinct

hive-索引(加優化)

hive儲存格式優化調研報告

mysql的order by，group by和distinct優化

Hive常用效能優化方法實踐全面總結

hive優化-count(distinct)

Hive SQL優化之 Count Distinct

hive語句優化-通過groupby實現distinct

針對數據庫索引的優化

［iOS］關於 App 混合（Hybrid）開發的優化，包括H5、Weex等（本篇博客主要針對 iOS 應用講解，但該思想同樣適用於Android）

Hive優化

hive------ Group by、join、distinct等實現原理

Hive針對distinct的優化

相關推薦