hive優化方式和使用技巧
部分內容出處:
http://www.atatech.org/article/detail/5617/0
一.UDFS函式介紹
1. 基本UDF
(1)SHOWFUNCTIONS:這個用來熟悉未知函式。
DESCRIBE FUNCTION<function_name>;
(2)A IS NULL
A IS NOT NULL
(3)A LIKE B 普通sql匹配如 like “a%”
A RLIKE B通過正則表示式匹配
A REGEXP B 通過正則表示式匹配
(4)round(double a):四捨五入
(5)rand(),rand(int seed):返回在(0,1)平均分佈的隨機數
(6)COALESCE(pv, 0):將 pv 為 null 的行轉為0,很實用
2. 日期函式
(1)datediff(string enddate, stringstartdate):
返回enddate和startdate的天數的差,例如datediff('2009-03-01','2009-02-27') = 2
(2)date_add(stringstartdate, int days):
加days天數到startdate:date_add('2008-12-31', 1) ='2009-01-01'
(3)date_sub(stringstartdate, int days):
減days天數到startdate:date_sub('2008-12-31', 1) ='2008-12-30'
(4)date_format(date,date_pattern)
CREATETEMPORARY FUNCTION date_format AS'com.taobao.hive.udf.UDFDateFormat';
根據格式串format 格式化日期和時間值date,返回結果串。
date_format('2010-10-10','yyyy-MM-dd','yyyyMMdd')
(5)str_to_date(str,format)
將字串轉化為日期函式
CREATE TEMPORARY FUNCTIONstr_to_date AS 'com.taobao.hive.udf.UDFStrToDate';
str_to_date('09/01/2009','MM/dd/yyyy')
3. 字串函式
(1)length(stringA):返回字串長度
(2)concat(stringA, string B...):
合併字串,例如concat('foo','bar')='foobar'。注意這一函式可以接受任意個數的引數
(3)substr(stringA, int start) substring(string A,int start):
返回子串,例如substr('foobar',4)='bar'
(4)substring(string A, int start,int len):
返回限定長度的子串,例如substr('foobar',4, 1)='b'
(5)split(stringstr, string pat):
返回使用pat作為正則表示式分割str字串的列表。例如,split('foobar','o')[2] = 'bar'。
(6)getkeyvalue(str,param):
從字串中獲得指定 key 的 value 值 UDFKeyValue
CREATE TEMPORARY FUNCTION getkeyvalue AS 'com.taobao.hive.udf.UDFKeyValue';
4. 自定義函式
(1)row_number
CREATE TEMPORARY FUNCTION row_number AS 'com.taobao.ad.data.search.udf.UDFrow_number';
select ip,uid,row_number(ip,uid) from (
select ip,uid,logtime from atpanel
distribute by ip,uid
sort by ip,uid,logtime desc
) a
(2)拆分key_value鍵值對
CREATE TEMPORARY FUNCTION ExplodeEX AS 'com.taobao.hive.udtf.UDTFExplodeEX';
select
split(kvs,'_')[0] as key,
split(kvs,'_')[1] as key,
from ( select 'a-1|b-2' as kv from dual ) t
lateral view explode (split(kv,'\\|')) result as kvs
二. HIVE新特性
1. 支援多列的COUNT(*)和COUNT DISTINCT查詢
select count(distinct col1, col2) from table_name;select count(*) from table_name;
2. 提供以本地模式執行Hive的選項
設定mapred.job.tracker=local可開啟本地執行模式
3. 增強的列重新命名語法
增加 ALTERTABLE table_name CHANGE old_name new_name語法。
4. 支援UNIQUE JOIN HIVE-591
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
5. 增加檢測表和分割槽狀態的語法HIVE-667
使用show table_name語法,檢查表和分割槽的狀態,包括大小和建立、訪問時間戳。6. 增加建表時支援STRUCT,結構體
7. 增加選擇驅動表的提示
8. 增加/*+STREAMTABLE(tb_alias)*/ HINT,以在Join操作時指定驅動表:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.valFROM a
JOIN b ON (a.key = b.key1)JOIN c ON (c.key = b.key1)
指定此HINT後,原先預設的右表驅動會失效。
9. left Semi-Join HIVE-870
Left Semi-Join是可以高效實現IN/EXISTS子查詢的語義。以下SQL語義:
(1)SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM b);
未實現Left Semi-Join之前,Hive實現上述語義的語句是:
SELECT t1.key, t1.value FROM a t1
left outer join (SELECT distinct key from b) t2
on t1.id = t2.id where t2.id is not null;
(2)可被替換為Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
這一實現減少至少1次MapReduce過程,注意Left Semi-Join的Join條件必須是等值。
10.Skew Join優化 HIVE-964 ,資料傾斜
優化skewed join key為map join。開啟hive.optimize.skewjoin=true可優化傾斜的資料。Skew Join優化需要額外的mapjoin操作,且不能節省shuffle的代價。
11.Sorted merge (map) join HIVE-1194
(對關鍵表key排序)
如果MapJoin中的表都是有序的,這一特性使得Join操作無需掃描整個表,這將大大加速Join操作。可通過hive.optimize.bucketmapjoin.sortedmerge=true開啟這個功能,獲得高的效能提升。
12.支援ALTER TABLE修改分割槽的InputFormat/OutputFormat定義
這一特性使得我們可以用壓縮方式(SequenceFileInputFormat)儲存後續表分割槽的資料,同時又不需要對以前的表分割槽做修改,即透明切換到壓縮格式。
13.支援併發提交沒有依賴關係的MR過程HIVE-549
此前的Hive僅僅順序提交MR任務。這一增強使得沒有依賴關係的多次MR過程(例如Union all語義中的多個子查詢)可以併發提交。某些情況下可以提高單條HQL命令的響應速度。以下引數對併發提交功能啟作用:
hive.exec.parallel[=false]
hive.exec.parallel.thread.number[=8]
14.Sorted Group byHIVE-931
(中間表的預處理)
對已排序的欄位做Group by可以不再額外提交一次MR過程。這種情況下可以提高執行效率。
15.UDTF支援
UDTF即User defined table function,是一種UDF,區別是這種UDF可以返回多條記錄。這一修改使得當前很多Transform指令碼可以被替換為更通用、更高效、更使用者友好的UDTF實現。UDTF是一種1:n輸出,可用於行轉列等。
UDTF不支援UDTF/列混合的select、不支援巢狀、不支援相同子查詢中的GROUP BY / CLUSTER BY /DISTRIBUTE BY / SORT BY。
UDTF可與Lateral View相結合。
動態分割槽可通過設定hive.exec.dynamic.partition=true開啟DP特性。使用方法:
INSERT OVERWRITETABLE tbl partition (col1[=value][, col2[=value] …])
使用hive.exec.dynamic.partition.mode = nonstrict動態分割槽有一定風險,包括小檔案、覆蓋資料等。預設分割槽開關:
hive.exec.default.dynamic.partition.name
17.插入強制排序HIVE-1193
只需要開啟hive.enforce.sorting選項即可。這一特性對Sorted merge bucket (map) join非常有用
18.支援檢視功能
可用於欄位級別的許可權控制
19.支援持笛卡爾積join(1.0特性 SELECT a.*, b.*FROM aCROSS JOIN b
CREATE VIEW [IF NOT EXISTS] view_name
[ (column_name [COMMENT column_comment], … ) ]
[COMMENT ‘view_comment’]
AS SELECT …
[ ORDER BY … LIMIT … ]
三. hive優化方式總結
1. 多表join優化程式碼結構:
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
關聯條件相同多表join會優化成一個job
2. LeftSemi-Join是可以高效實現IN/EXISTS子查詢的語義
SELECT a.key,a.value FROM a WHERE a.key in (SELECT b.key FROM b);
(1)未實現Left Semi-Join之前,Hive實現上述語義的語句是:
SELECT t1.key, t1.valueFROM a t1
left outer join (SELECT distinctkey from b) t2 on t1.id = t2.id
where t2.id is not null;
(2)可被替換為Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
這一實現減少至少1次MR過程,注意Left Semi-Join的Join條件必須是等值。
3. 預排序減少map join和group by掃描資料HIVE-1194
(1)重要報表預排序,開啟hive.enforce.sorting選項即可
(2)如果MapJoin中的表都是有序的,這一特性使得Join操作無需掃描整個表,這將大大加速Join操作。可通過
hive.optimize.bucketmapjoin.sortedmerge=true開啟這個功能,獲得高的效能提升。
set hive.mapjoin.cache.numrows=10000000;
set hive.mapjoin.size.key=100000;
Insert overwrite table pv_users
Select /*+MAPJOIN(pv)*/ pv.pageid,u.age
from page_view pv
join user u on (pv.userid=u.userid;
(3)Sorted Group byHIVE-931
對已排序的欄位做Group by可以不再額外提交一次MR過程。這種情況下可以提高執行效率。
4. 次性pv uv計算框架
(1)多個mr任務批量提交
hive.exec.parallel[=false]
hive.exec.parallel.thread.number[=8]
(2) 一次性計算框架,結合multi group by
如果少量資料多個union會優化成一個job;
反之計算量過大可以開啟批量mr任務提交減少計算壓力;
利用兩次group by 解決count distinct 資料傾斜問題
Set hive.exec.parallel=true;
Set hive.exec.parallel.thread.number=2;
From(
Select
Yw_type,
Sum(case when type=’pv’ then ct end) as pv,
Sum(case when type=’pv’ then 1 end) as uv,
Sum(case when type=’click’ then ct end) as ipv,
Sum(case when type=’click’ then 1 end) as ipv_uv
from (
select
yw_type,log_type,uid,count(1) as ct
from (
select ‘total’ yw_type,‘pv’ log_type,uid from pv_log
union all
select ‘cat’ yw_type,‘click’ log_type,uid from click_log
) t group by yw_type,log_type
) t group by yw_type
) t
Insert overwrite table tmp_1
Select pv,uv,ipv,ipv_uv
Where yw_type=’total’
Insert overwrite table tmp_2
Select pv,uv,ipv,ipv_uv
Where yw_type=’cat’;
5. 控制hive中的map和reduce數
(1)合併小檔案
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive.input.format=……表示合併小檔案。大於檔案塊大小128m的,按照128m來分隔,小於128m,大於100m的,按照100m來分隔,把那些小於100m的(包括小檔案和分隔大檔案剩下的),進行合併,最終生成了74個塊
(2)耗時任務增大map數
setmapred.reduce.tasks=10;
6. 利用隨機數減少資料傾斜
大表之間join容易因為空值產生資料傾斜
select
a.uid
from big_table_a a
left outer join big_table_b b
on b.uid = case when a.uid is null or length(a.uid)=0
then concat('rd_sid',rand()) else a.uid end;
四. 小技巧
1.空值處理, 結果表\N用空字串代替
ALTER TABLE a SETSERDEPROPERTIES('serialization.null.format' = '');
2. 避免暴力掃描分割槽
今日全量=昨日全量+今日增量
30資料=前一個30日資料-31日資料+今日資料
適用場景:需求穩定,需要訪問30天或1年資料
3. 利用動態分割槽減少任務執行時間
五. 通過JobTracker 源資料找出低效程式碼
1. On條件沒寫或者掃描過多分割槽情況
Uv計算參考一次性pv uv計算框架解決方案,on或者分割槽條件沒寫去掉即可
select
id as 天網id,prgname as 任務路徑,viewname as 顯示名稱,job_id ,job_name,job_value,
length(trim(inputdir))-length(replace(trim(inputdir),',',''))+1 as pathcnt
from (
select
t1.id,t1.prgname,t1.viewname,
t3.job_id,t3.job_name ,
t3.job_value,
DBMS_LOB.SUBSTR(t3.job_value,4000) as inputdir
from(
select
id,prgname,paravalue,viewname from dwa.etl_task_program t
where priority in('xx','xxx') --##統計的時候輸入自己的業務基線id
and appflag=0
) t1,
dwa.hdp_job_map t2,
dwa.hdp_job_conf t3
where t1.id = t2.id
and t2.job_id = t3.job_id
and t2.gmtdate = trunc(sysdate-1)
and t3.gmtdate = trunc(sysdate-1)
and t3.job_name = 'mapred.input.dir'
)
where length(trim(inputdir))-length(replace(trim(inputdir),',','')) > 10;
2. 同一個指令碼相同單表被掃描多次
儘量把所需要的資料一次性讀出來
select sky_id as 天網id,viewname as 天網顯示名稱,
tab_name as 被掃描表,on_duty as 負責人,count(1) as 掃描次數
from(
select distinct a.tab_name,c.sql_id,a.sub_sql_id,c.sky_id,e.viewname,e.on_duty
from dwa.meta_tab a,
dwa.meta_sqlsub b,
(select * from
(select sky_id,sql_id,sql_src,
row_number() over(partition by sky_id,length(sql_src) order by sql_id) rn
from dwa.meta_sqlfull
)where rn=1) c,
dwa.meta_col d,dwa.etl_task_program e
where e.priority in('xx','xxx') --##統計的時候輸入自己的業務基線id
and e.appflag=0 and e.id=c.sky_id
and a.sub_sql_id=b.sub_sql_id and a.tab_id=d.tab_id and a.sub_sql_id=d.sub_sql_id and b.sqlfull_id=c.sql_id
and a.tab_name not like '%-%' and b.sql_type='select'
order by c.sky_id,c.sql_id,a.sub_sql_id
)group by sky_id,viewname,tab_name,on_duty
having count(1) >1
order by cnt desc;
3. Job數過多
儘量一次性讀取所需資料
才有union方式合併任務
Left outer join on條件相同會合併成一個job
SELECT /*+ parallel(t,32) */
groupname,
id,
BIZ_SORTID,
ON_DUTY,
PRGNAME,
job_cnt,
JOB_TOTAL_MAPS,
JOB_TOTAL_REDUCES,
TOTAL_TIME,
HDFS_BYTES_READ,
HDFS_BYTES_WRITTEN,
TOTAL_MAP_TIME,
TOTAL_REDUCE_TIME,
MAP_INPUT_RECORDS,
MAP_OUTPUT_RECORDS,
REDUCE_INPUT_RECORDS,
REDUCE_OUTPUT_RECORDS,
time,
row_number() over(partition by groupname order by TIME desc) rn_time,
row_number() over(partition by groupname order by TOTAL_MAP_TIME+TOTAL_REDUCE_TIME desc) rn_slots
from(
select
DWA.ETL_TASK_BASELINE.name as groupname,
DWA.HDP_JOB_MAP.ID,
DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
DWA.ETL_TASK_PROGRAM.ON_DUTY,
DWA.ETL_TASK_LOG.PRGNAME,
count(DWA.HDP_JOB_MAP.job_id) job_cnt, --天網任務的job數
sum(DWA.HDP_JOB_STAT.JOB_TOTAL_MAPS) JOB_TOTAL_MAPS,
sum(DWA.HDP_JOB_STAT.JOB_TOTAL_REDUCES) JOB_TOTAL_REDUCES,
sum(DWA.HDP_JOB_STAT.TOTAL_TIME) TOTAL_TIME,
sum(DWA.HDP_JOB_STAT.HDFS_BYTES_READ) HDFS_BYTES_READ,
sum(DWA.HDP_JOB_STAT.HDFS_BYTES_WRITTEN) HDFS_BYTES_WRITTEN,
sum(DWA.HDP_JOB_STAT.TOTAL_MAP_TIME) TOTAL_MAP_TIME,
sum(DWA.HDP_JOB_STAT.TOTAL_REDUCE_TIME) TOTAL_REDUCE_TIME,
sum(DWA.HDP_JOB_STAT.MAP_INPUT_RECORDS) MAP_INPUT_RECORDS,
sum(DWA.HDP_JOB_STAT.MAP_OUTPUT_RECORDS) MAP_OUTPUT_RECORDS, --new
sum(DWA.HDP_JOB_STAT.REDUCE_INPUT_RECORDS) REDUCE_INPUT_RECORDS,
sum(DWA.HDP_JOB_STAT.REDUCE_OUTPUT_RECORDS) REDUCE_OUTPUT_RECORDS, --new
trunc((DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60) time
FROM
DWA.HDP_JOB_MAP,
DWA.ETL_TASK_PROGRAM,
DWA.ETL_TASK_LOG,
DWA.HDP_JOB_STAT,
DWA.ETL_TASK_BASELINE
WHERE
( DWA.HDP_JOB_STAT.JOB_ID=DWA.HDP_JOB_MAP.JOB_ID )
AND ( DWA.HDP_JOB_MAP.ID=DWA.ETL_TASK_LOG.ID )
AND ( DWA.ETL_TASK_LOG.ID=DWA.ETL_TASK_PROGRAM.ID )
AND ( DWA.ETL_TASK_PROGRAM.BASELINE_ID=DWA.ETL_TASK_BASELINE.ID )
AND
(
( ( DWA.HDP_JOB_STAT.GMTDATE ) = trunc(sysdate) )
AND
( ( DWA.HDP_JOB_MAP.GMTDATE ) = trunc(sysdate) )
AND
( ( DWA.ETL_TASK_LOG.GMTDATE ) = trunc(sysdate) )
AND DWA.ETL_TASK_PROGRAM.priority in('xx','xxx') --##統計的時候輸入自己的業務基線id
)
GROUP BY
DWA.ETL_TASK_BASELINE.name,
DWA.HDP_JOB_MAP.ID,
DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
DWA.ETL_TASK_PROGRAM.ON_DUTY,
DWA.ETL_TASK_LOG.PRGNAME,
(DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60
) t
where time is not null and job_cnt>10 --job數量,可以自己定義;
4. From表個數過多(節點入度過高)
select sky_id as 天網id,viewname as 顯示名稱,
sum(cnt) as 來源表使用次數,count(cnt) as 來源表個數
from(
select sky_id,viewname,tab_name,on_duty,count(1) cnt
from(
select distinct a.tab_name,c.sql_id,a.sub_sql_id,c.sky_id,e.viewname,e.on_duty
from dwa.meta_tab a,dwa.meta_sqlsub b,
(
select *
from(
select sky_id,sql_id,sql_src,
row_number() over(partition by sky_id,length(sql_src) order by sql_id) rn
from dwa.meta_sqlfull)
where rn=1
) c,
dwa.meta_col d,dwa.etl_task_program e
where e.priority in('xx','xxx') --##統計的時候輸入自己的業務基線id
and e.appflag=0 and e.id=c.sky_id
and a.sub_sql_id=b.sub_sql_id
and a.tab_id=d.tab_id and a.sub_sql_id=d.sub_sql_id and b.sqlfull_id=c.sql_id
and a.tab_name not like '%-%' and b.sql_type='select'
order by c.sky_id,c.sql_id,a.sub_sql_id
)
group by sky_id,viewname,tab_name,on_duty
order by cnt desc
)
group by sky_id,viewname
order by sum(cnt) desc;
5. Job傾斜情況
空值處理方法:
(1)直接過濾掉
(2)空值加上隨機數分散到不同的reduce
解決方法一job2,方法二job1
select a11.GMTDATE as 任務執行日期,
a11.GROUP_NAME as 業務線名稱,
a11.ID as 天網id,
a11.SORT_ID as 雲梯優先順序,
a11.NAME as 天網顯示名稱,
a11.JOB_ID as job_id,
a11.KEY_FLAG 是否關鍵節點任務,
a11.USER_NAME 使用者名稱,
sum(a11.JOB_AVG_TIME) WJXBFS1,
sum(a11.JOB_MAX_TIME) WJXBFS2,
sum(a11.JOB_AVG_RECORDS) WJXBFS3,
sum(a11.JOB_MAX_RECORDS) WJXBFS4
from DWA.VIEW_HDP_JOB_STAT a11
where gmtdate=date'2012-09-27'
and group_name in ('xxxxx')
--業務線名稱即天網任務配置裡的“專案”
group by a11.GMTDATE,
a11.GROUP_NAME,
a11.ID,
a11.SORT_ID,
a11.NAME,
a11.JOB_ID,
a11.KEY_FLAG,
a11.USER_NAME ;
6. 相同輸入位元組數的任務抽取與合併
資料來源相同的任務,抽取相同的job進行合併
drop table gv_job_mapinput;
create table gv_job_mapinput as
select
id,prgname,job_id,MAP_INPUT_BYTES
from
(
select
DWA.ETL_TASK_BASELINE.name groupname,
DWA.HDP_JOB_MAP.ID,
DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
DWA.ETL_TASK_PROGRAM.ON_DUTY,
DWA.ETL_TASK_LOG.PRGNAME,
DWA.HDP_JOB_MAP.job_id, --天網任務的job數
sum(DWA.HDP_JOB_STAT.JOB_TOTAL_MAPS) JOB_TOTAL_MAPS,
sum(DWA.HDP_JOB_STAT.JOB_TOTAL_REDUCES) JOB_TOTAL_REDUCES,
sum(DWA.HDP_JOB_STAT.TOTAL_TIME) TOTAL_TIME,
sum(DWA.HDP_JOB_STAT.HDFS_BYTES_READ) HDFS_BYTES_READ,
sum(DWA.HDP_JOB_STAT.HDFS_BYTES_WRITTEN) HDFS_BYTES_WRITTEN,
sum(DWA.HDP_JOB_STAT.TOTAL_MAP_TIME) TOTAL_MAP_TIME,
sum(DWA.HDP_JOB_STAT.TOTAL_REDUCE_TIME) TOTAL_REDUCE_TIME,
sum(DWA.HDP_JOB_STAT.MAP_INPUT_RECORDS) MAP_INPUT_RECORDS,
sum(DWA.HDP_JOB_STAT.MAP_INPUT_BYTES) MAP_INPUT_BYTES,
sum(DWA.HDP_JOB_STAT.MAP_OUTPUT_RECORDS) MAP_OUTPUT_RECORDS, --new
sum(DWA.HDP_JOB_STAT.REDUCE_INPUT_RECORDS) REDUCE_INPUT_RECORDS,
sum(DWA.HDP_JOB_STAT.REDUCE_OUTPUT_RECORDS) REDUCE_OUTPUT_RECORDS, --new
trunc((DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60) time
FROM
DWA.HDP_JOB_MAP,
DWA.ETL_TASK_PROGRAM,
DWA.ETL_TASK_LOG,
DWA.HDP_JOB_STAT,
DWA.ETL_TASK_BASELINE
WHERE
( DWA.HDP_JOB_STAT.JOB_ID=DWA.HDP_JOB_MAP.JOB_ID )
AND ( DWA.HDP_JOB_MAP.ID=DWA.ETL_TASK_LOG.ID )
AND ( DWA.ETL_TASK_LOG.ID=DWA.ETL_TASK_PROGRAM.ID )
AND ( DWA.ETL_TASK_PROGRAM.BASELINE_ID=DWA.ETL_TASK_BASELINE.ID )
AND
(
( ( DWA.HDP_JOB_STAT.GMTDATE ) = trunc(sysdate) )
AND
( ( DWA.HDP_JOB_MAP.GMTDATE ) = trunc(sysdate) )
AND
( ( DWA.ETL_TASK_LOG.GMTDATE ) = trunc(sysdate) )
AND
DWA.ETL_TASK_PROGRAM.priority in('xx','xxx')
--##統計的時候輸入自己的業務基線id
)
GROUP BY
DWA.ETL_TASK_BASELINE.name,
DWA.HDP_JOB_MAP.ID,
DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
DWA.ETL_TASK_PROGRAM.ON_DUTY,
DWA.ETL_TASK_LOG.PRGNAME,
DWA.HDP_JOB_MAP.job_id,
(DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60
)
order by MAP_INPUT_RECORDS desc ,job_id;
select * from gv_job_mapinput
where id exists (
select id from
(select id,prgname,count(job_id) cnt from gv_job_mapinput group by id,prgname)
where cnt =1
)
order by MAP_INPUT_BYTES desc;
7. 多個任務只有一個共同的父任務
drop table gvora_view_relation;
create table gvora_view_relation as
select a.id,a.viewname,a.on_duty,a.sourceid,a.priority,a.parentid,
b.viewname parentviewname,b.on_duty pon_duty,b.sourceid psourceid,b.priority p_priority
from(
select a.id,b.viewname,b.on_duty,b.sourceid,b.priority,a.parentid from
dwa.etl_task_relation a,
dwa.etl_task_program b
where a.id=b.id
) a,
dwa.etl_task_program b
where a.parentid=b.id;
select a.id as 天網id,a.viewname as 顯示名稱,rudu,chudu
from(
select id,viewname,count(1) rudu from gvora_view_relation
where priority in('xx','xxx')
--##統計的時候輸入自己的業務基線id
group by id,viewname
) a,
(
select parentid,parentviewname,count(1) chudu from gvora_view_relation
where priority in('xx','xxx')
--##統計的時候輸入自己的業務基線id
group by parentid,parentviewname
) b
where a.id=b.parentid
order by rudu +chudu desc;