[Hive]Hive調優:讓任務並行執行
阿新 • • 發佈:2019-02-10
業務背景
extract_trfc_page_kpi的hive sql如下:
set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set mapred.job.name=extract_trfc_page_kpi;
insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct
page_type_id,
pv,
uv,
'$yesterday' update_time
from
(
--針對PC、H5
select
page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi
where ds = '$yesterday' and stat_type = 1
group by page_type_id
union all
--PC搜尋頁特殊處理
select
5 as page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi
where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)
union all
--針對APP
select
a.page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi a
left outer join (
select distinct
page_type_id,
old_page_type_id
from tandem.mobile_backend_page_url_rule
where is_delete = 0
) b on (a.page_type_id = b.old_page_type_id)
where a.ds = '$yesterday' and stat_type = 1
group by a.page_type_id
) t;
上面的sql中存在兩個union all操作,順序執行下來的話,需要耗時20分鐘。
優化策略
分析以上的sql,其中union all前後的三個查詢操作並無直接關聯,因此沒有必要順序執行,因此優化的思路是讓這三個查詢操作並行執行,hive提供瞭如下引數實現job的並行操作:
// 開啟任務並行執行
set hive.exec.parallel=true;
// 同一個sql允許並行任務的最大執行緒數
set hive.exec.parallel.thread.number=8;
方案一
在執行sql時加上上面的兩個hive引數,如:
set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set mapred.job.name=extract_trfc_page_kpi;
insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct
page_type_id,
pv,
uv,
'$yesterday' update_time
from
(
--針對PC、H5
select
page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi
where ds = '$yesterday' and stat_type = 1
group by page_type_id
union all
--PC搜尋頁特殊處理
select
5 as page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi
where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)
union all
--針對APP
select
a.page_type_id,
sum(pv) as pv,
sum(uv) as uv
from dw.rpt_trfc_page_kpi a
left outer join (
select distinct
page_type_id,
old_page_type_id
from tandem.mobile_backend_page_url_rule
where is_delete = 0
) b on (a.page_type_id = b.old_page_type_id)
where a.ds = '$yesterday' and stat_type = 1
group by a.page_type_id
) t;
方案二
在hive-site.xml中進行設定,檢視當前版本hive的配置引數:
hive> set -v;
...
hive.exec.orc.zerocopy=false
hive.exec.parallel=false
hive.exec.parallel.thread.number=8
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
hive.exec.script.maxerrsize=100000
hive.exec.script.trust=false
hive.exec.show.job.failure.debug.info=true
...
這些引數是配置在$HIVE_HOME/conf/hive-site.xml中的,現在在這個配置檔案中加入:
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>16</value>
</property>
重新啟動hive,看到剛剛配置的引數已經生效了:
hive> set -v;
...
hive.exec.orc.skip.corrupt.data=false
hive.exec.orc.zerocopy=false
hive.exec.parallel=true
hive.exec.parallel.thread.number=16
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
...
結論
經過測試,添加了這兩個引數以後,extract_trfc_page_kpi指令碼執行時間從耗時20分鐘,優化為耗時3分鐘。