生產sql調優之統計資訊分析(89天)
今天凌晨,又被電話叫醒了,說是有1個sql,現在跑的很慢。問題已經挺嚴重了,想讓我看看,能不能做點什麼。
首先就是和他們確認最近有什麼改動,他們說這個是用了很久的sql語句了,沒有任何的改動,再聽他們說,之前也沒有任何的問題。
然後就和他們確認之前這個jobl處理大概多長時間,那個哥們說他也是最近才進的這個專案,可能資料量不同,他也不清楚。但是他肯定的說這個job會跑的很快,幾個小時肯定能處理完。
大概瞭解了下,他們也確定具體的sql語句是什麼,沒有得到太多的資訊,首先通過top命令來抓一下目前消耗資源比較多的程序。
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9543 oradb1 25 0 12.2g 25m 21m R 100.0 0.0 426:03.72 oracleDB01 (LOCAL=NO)
32334 oradb1 25 0 12.2g 26m 22m R 100.0 0.0 36:46.28 oracleDB01 (LOCAL=NO)
if [ -z "$1" ]; then echo "no process has provided!" exit 0 fi sh_tmp_process=`sqlplus -silent $DB_CONN_STR@$SH_DB_SID <<END set pagesize 0 feedback off verify off heading on echo off select addr from v\$process where spid=$1; exit; END` if [ -z "$sh_tmp_process" ]; then echo "no process exists or session is not from a DB account" echo echo "####### Process Information from OS level as below ########" ps -ef|grep $1|grep -v "grep"|grep ora echo "##############################################" exit 0 else echo '*******************************************' echo "Process has found, pid: $1 , addr: $sh_tmp_process " echo echo "####### Process Information from OS level as below ########" ps -ef|grep $1|grep -v grep|grep ora echo "##############################################" sqlplus -s $DB_CONN_STR@$SH_DB_SID <<EOF col machine format a20 col terminal format a15 col osuser format a15 col process format a15 col username format a15 set linesize 150 select sid,serial#,username,osuser ,machine,process,terminal,type,to_char(LOGON_TIME,'yyyy-mm-dd hh24:mi:ss')login_time from v$session where paddr='$sh_tmp_process'; prompt . col sql_id format a30 col prev_sql_id format a30 col sql_text format a60 set linesize 150 set pages 50 select sql_id,sql_text from v$sql where sql_id in (select sql_id from v$session where paddr='$sh_tmp_process' and sql_id is not null ) and rownum<2; select sql_id prev_sql_id ,sql_text from v$sql where sql_id in (select prev_sql_id sql_id from v$session where paddr='$sh_tmp_process' ) and rownum<2; EOF fi
結果看了一下,和他們確認,就是他們的job要用的sql語句。可以看到裡面已經有一些hint,想必是之前就碰到過問題,優化過的。 這個語句也算是挺長的一個sql了,裡面用到了union all來做兩個結果的合併。
SELECT DISTINCT 'K', AR.RESOURCE_VALUE, AR.RESOURCE_TYPE, GREATEST(TO_CHAR(AR.EFFECTIVE_DATE, 'YYYYMMDDHH24MISS'), TO_CHAR(SB.EFFECTIVE_DATE, 'YYYYMMDDHH24MISS')), LEAST(NVL(TO_CHAR(AR.EXPIRATION_DATE, 'YYYYMMDDHH24MISS'), '47001231000000'), NVL(TO_CHAR(SB.EXPIRATION_DATE, 'YYYYMMDDHH24MISS'), '47001231000000')), AR.AGR_NO, SB.MEDIUM_CUS_ID, SB.SUB_STATUS, SB.BUSINESS_ENTITY_ID, SB.LANGUAGE, SB.ROUTING_POLICY_ID, SB.L9_PORT_IND, SB.L9_SPLIT_PERIOD FROM HUGE_RESOURCE AR, MEDIUM_SUB SB, (select /*+ RESULT_CACHE */ DISTINCT PARAM_NAME as PARAM_NAME from SMALL_PARAM where GUIDING_IND = 'Y') OP, MEDIUM_CUS CS WHERE AR.AGR_NO = 1056851 AND AR.AGREEMENT_KEY = MOD(1056851, 100) AND (AR.RESOURCE_STATE != 'F' OR AR.RESOURCE_STATE IS NULL) AND AR.RANGE_IND = 'N' AND SB.MEDIUM_SUB_NO = AR.AGR_NO AND EXISTS (select /*+ INDEX(OP SMALL_PARAM_1IX) */ 1 from SMALL_PARAM OP where OP.PARAM_NAME in (AR.RESOURCE_PRM_CD, AR.BASE_PARAM_NAME) ) AND SB.MEDIUM_CUS_ID = CS.MEDIUM_CUS_ID AND (SB.EXPIRATION_DATE IS NULL OR SB.EXPIRATION_DATE >= to_date('19000101000000','YYYYMMDDHH24MISS')) AND (AR.EXPIRATION_DATE IS NULL OR AR.EXPIRATION_DATE >= to_date('19000101000000','YYYYMMDDHH24MISS')) AND (AR.EFFECTIVE_DATE < SB.EXPIRATION_DATE OR SB.EXPIRATION_DATE IS NULL) AND (AR.EXPIRATION_DATE IS NULL OR AR.EXPIRATION_DATE > SB.EFFECTIVE_DATE) AND SB.SUB_STATUS != 'T' UNION ALL SELECT DISTINCT 'K', AR.RESOURCE_VALUE, AR.RESOURCE_TYPE, GREATEST(TO_CHAR(AR.EFFECTIVE_DATE, 'YYYYMMDDHH24MISS'), TO_CHAR(SH.EFFECTIVE_DATE, 'YYYYMMDDHH24MISS')), LEAST(NVL(TO_CHAR(AR.EXPIRATION_DATE, 'YYYYMMDDHH24MISS'), '47001231000000'), NVL(TO_CHAR(SH.EXPIRATION_DATE, 'YYYYMMDDHH24MISS'), '47001231000000')), AR.AGR_NO, SH.MEDIUM_CUS_ID, SH.SUB_STATUS, SH.BUSINESS_ENTITY_ID, SH.LANGUAGE, SH.ROUTING_POLICY_ID, SH.L9_PORT_IND, SH.L9_SPLIT_PERIOD FROM HUGE_RESOURCE AR, MEDIUM_SUB_HISTORY SH, (select /*+ RESULT_CACHE */ DISTINCT PARAM_NAME as PARAM_NAME from SMALL_PARAM where GUIDING_IND = 'Y') OP, MEDIUM_CUS CS WHERE AR.AGR_NO = 1056851 AND AR.AGREEMENT_KEY = MOD(1056851, 100) AND (AR.RESOURCE_STATE != 'F' OR AR.RESOURCE_STATE IS NULL) AND AR.RANGE_IND = 'N' AND SH.MEDIUM_SUB_NO = AR.AGR_NO AND EXISTS (select /*+ INDEX(OP SMALL_PARAM_1IX) */ 1 from SMALL_PARAM OP where OP.PARAM_NAME in (AR.RESOURCE_PRM_CD, AR.BASE_PARAM_NAME)) AND SH.MEDIUM_CUS_ID = CS.MEDIUM_CUS_ID AND (SH.EXPIRATION_DATE IS NULL OR SH.EXPIRATION_DATE >= to_date('19000101000000','YYYYMMDDHH24MISS')) AND (AR.EXPIRATION_DATE IS NULL OR AR.EXPIRATION_DATE >= to_date('19000101000000','YYYYMMDDHH24MISS')) AND (AR.EFFECTIVE_DATE < SH.EXPIRATION_DATE OR SH.EXPIRATION_DATE IS NULL) AND (AR.EXPIRATION_DATE IS NULL OR AR.EXPIRATION_DATE > SH.EFFECTIVE_DATE) AND SH.SUB_STATUS NOT IN ('C', 'T')
和他們確認了基本的時間後,先抓一個awr看看裡面怎麼說。然後在這個基礎上抓一個awr的sql report。看看執行計劃的情況。 從執行計劃來看,也是一個執行時間比較的sql語句,檢視指標,消耗並不高,大概執行時間為5秒,但是根據他們的反饋,需要跑幾十萬個這樣的查詢。這樣下來。這個語句得跑 8秒*80萬次/60/60=111個小時,這樣確實太長了。得跑將近5天的樣子。 首先既然他們肯定了sql語句沒有做過改動,而且也已經經過調優了,那就先看看為什麼顯示執行計劃消耗沒那麼高,實際上卻差別這麼大,首先想到的就是statistics的問題。 在查看了關聯的幾個達標之後,發現有一個大表的統計資訊誤差有10%左右,其他的都在1%以內的樣子。 這也是一個可能原因,在檢視索引的使用情況,都走索引了,沒有任何索引失效的情況。 sql語句不能隨便動,只能下手的地方就是統計資訊了,其他的效能引數一直都沒有做過改動了。 和他們確認之後,先做了一個統計資訊收集。
exec DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=> NULL, TABNAME => 'HUGE_RESOURCE', CASCADE => TRUE,METHOD_OPT =>'FOR ALL INDEXED COLUMNS SIZE 1', ESTIMATE_PERCENT => 40 , DEGREE => 8,GRANULARITY =>'ALL');
執行之後就建立了一個快照,然後過了幾分鐘之後又生成一個快照,發現明顯快了不少,從5秒降低到了2秒。也算是不少的收穫了。2秒*80000次/60/60=44.4小時,還是有一些問題,然後剩下的問題就是看看之前的執行情況了。 下面是我在最近這一個月左右的時間內現有的awr中找到的。時間浮動還是比較小。
Elapsed Time (s) |
Executions |
Elapsed Time per Exec (s) |
%Total |
%CPU |
%IO |
SQL Id |
SQL Module |
SQL Text |
|
---|---|---|---|---|---|---|---|---|---|
31st May |
305.44 |
145 |
2.11 |
25.56 |
99.69 |
0.26 |
2fjzq67jbztwv |
gext1GenericEx@ccbdbpr1 (TNS V1-V3) |
/* */ SELECT DISTINCT 'K', AR.... |
21st May |
413.88 |
614 |
0.67 |
1.50 |
99.74 |
0.17 |
8686n41y0jsnd |
gext1GenericEx@ccbdbpr1 (TNS V1-V3) |
/* */ SELECT DISTINCT 'K', AR.... |
12th May |
100.17 |
63 |
1.59 |
0.80 |
99.68 |
0.25 |
da18v3wt1u09y |
gext1GenericEx@ccbdbpr1 (TNS V1-V3) |
SELECT DISTINCT 'K', AR.RESOUR... |
9th April |
818.45 |
793 |
1.03 |
3.73 |
99.86 |
0.09 |
da18v3wt1u09y |
gext1GenericEx@ccbdbpr1 (TNS V1-V3) |
SELECT DISTINCT 'K', AR.RESOUR. |
然後我仔細比較了一下對應的sql語句。發現今天碰到sql語句和之前的有一些不同之處。看來也不能全信 偶有了他們說的。
有了基本的參照,就可以在這個基礎上分析了。