1. 程式人生 > 其它 >CPU 100%負載的效能優化分析(r7筆記第40天)

CPU 100%負載的效能優化分析(r7筆記第40天)

今天收到報警郵件,提示在短時間內DB time有了很大的抖動。報警郵件如下: ZABBIX-監控系統:

------------------------------------ 報警內容: DB time is too high ------------------------------------ 報警級別: PROBLEM ------------------------------------ 監控專案: DBtime:543 % ------------------------------------ 報警時間:2015.12.07-10:48:37 可以看到在藍色的框中是問題發生時間段的DB time情況,其實後面直接飆到了5000%這個效果是很恐怖的。

而緊接著CPU負載也開始急劇飆升,直接的反應就是機器反應開始非常慢。根據top得到的程序,可以看到cpu資源已經被耗光了。 這個從sar的結果也可以看得很明顯。 11:30:02 AM CPU %user %nice %system %iowait %steal %idle 11:31:02 AM all 95.86 0.00 0.36 0.00 0.00 3.78 11:32:02 AM all 94.76 0.00 0.25 0.00 0.00 4.99 11:33:01 AM all 94.53 0.00 0.17 0.00 0.00 5.30 11:34:01 AM all 95.83 0.00 0.19 0.00 0.00 3.98 11:35:01 AM all 95.18 0.00 0.17 0.00 0.00 4.66 11:36:01 AM all 96.47 0.00 0.96 0.02 0.00 2.55 11:37:01 AM all 97.50 0.00 0.15 0.08 0.00 2.28 11:38:01 AM all 96.08 0.00 0.17 0.00 0.00 3.75 11:39:01 AM all 95.74 0.00 0.17 0.01 0.00 4.08 11:40:01 AM all 96.24 0.00 0.17 0.00 0.00 3.58 11:41:01 AM all 94.68 0.00 0.34 0.00 0.00 4.98 11:42:01 AM all 97.30 0.00 0.15 0.00 0.00 2.54 11:43:01 AM all 96.51 0.00 0.18 0.00 0.00 3.31 11:44:01 AM all 96.14 0.00 0.14 0.00 0.00 3.71 11:45:01 AM all 95.52 0.00 0.20 0.01 0.00 4.26 11:46:01 AM all 96.67 0.00 0.94 0.01 0.00 2.38 11:47:01 AM all 96.23 0.00 0.16 0.01 0.00 3.61 11:48:01 AM all 96.59 0.00 0.16 0.00 0.00 3.25 11:49:01 AM all 96.40 0.00 0.18 0.00 0.00 3.41 11:50:02 AM all 96.15 0.00 0.21 0.00 0.00 3.64 11:51:01 AM all 96.72 0.00 0.59 0.04 0.00 2.64 11:52:01 AM all 96.51 0.00 0.16 0.00 0.00 3.32 11:53:01 AM all 94.57 0.00 0.20 0.01 0.00 5.22 11:54:01 AM all 93.82 0.00 0.14 0.00 0.00 6.04 11:55:01 AM all 94.04 0.00 0.14 0.01 0.00 5.81 11:56:01 AM all 96.01 0.00 0.85 0.00 0.00 3.14 11:57:01 AM all 95.66 0.00 0.15 0.01 0.00 4.17 11:58:01 AM all 94.93 0.00 0.14 0.00 0.00 4.93 11:59:01 AM all 97.15 0.00 0.16 0.01 0.00 2.68 12:00:01 PM all 99.76 0.00 0.23 0.00 0.00 0.01 12:01:01 PM all 99.45 0.00 0.55 0.00 0.00 0.01 12:02:01 PM all 99.76 0.00 0.24 0.00 0.00 0.00 12:03:02 PM all 99.60 0.00 0.40 0.00 0.00 0.00 12:04:02 PM all 99.76 0.00 0.23 0.00 0.00 0.00 12:05:01 PM all 99.77 0.00 0.23 0.00 0.00 0.00 12:06:05 PM all 99.37 0.00 0.62 0.00 0.00 0.01 12:07:02 PM all 99.59 0.00 0.40 0.00 0.00 0.00 12:08:03 PM all 99.57 0.00 0.43 0.00 0.00 0.00 12:09:02 PM all 99.66 0.00 0.34 0.00 0.00 0.00 12:10:01 PM all 99.80 0.00 0.20 0.00 0.00 0.01 12:11:15 PM all 99.70 0.00 0.30 0.00 0.00 0.00 12:12:07 PM all 99.87 0.00 0.12 0.00 0.00 0.00 12:13:03 PM all 99.74 0.00 0.26 0.00 0.00 0.00 12:14:02 PM all 99.79 0.00 0.21 0.00 0.00 0.00 12:15:02 PM all 99.85 0.00 0.14 0.00 0.00 0.01 12:16:04 PM all 99.58 0.00 0.42 0.00 0.00 0.00 所以問題其實變得還是挺著急的。這種負載情況下,看似問題還在惡化,著實讓人捏了一把汗。 top的情況如下: top - 13:34:40 up 469 days, 20:09, 6 users, load average: 44.01, 42.83, 43.48 Tasks: 515 total, 48 running, 453 sleeping, 0 stopped, 14 zombie Cpu(s): 99.9%us, 0.1%sy, 0.0%ni, 0.0%id, 0.0%wa

, 0.0%hi, 0.0%si, 0.0%st Mem: 32946232k total, 32753852k used, 192380k free, 209652k buffers Swap: 16771776k total, 186968k used, 16584808k free, 28973828k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8517 oracle 18 0 10.1g 305m 300m R 64.7 0.9 263:48.43 oracletestdb (LOCAL=NO) 17796 oracle 19 0 10.1g 174m 169m R 56.7 0.5 14:54.71 oracletestdb (LOCAL=NO) 18473 oracle 17 0 10.1g 171m 166m R 39.8 0.5 14:31.95 oracletestdb (LOCAL=NO) 18153 oracle 16 0 10.1g 175m 171m R 38.1 0.5 13:55.82 oracletestdb (LOCAL=NO)
根據top的情況抓取了一個程序,看看這個程序到底在幹嘛.比如程序號是16908. 可以看都原來是一個客戶端觸發的一個select語句導致。


$ sh showpid.sh 16908
*******************************************
Process has found, pid: 16908  ,  addr: 00000002DF6199B0    
####### Process Information from OS level as below ########
oracle   16908     1 17 12:00 ?        00:14:48 oracletestdb (LOCAL=NO)
##############################################
       SID    SERIAL# USERNAME        OSUSER          MACHINE              PROCESS         TERMINAL        TYPE       LOGIN_TIME
---------- ---------- --------------- ---------------  -------------------- --------------- --------------- ----------  -------------------
       771      41269 TESTAT_NEWBG                     mbilocalhost         1234                            USER        2015-12-07 12:00:05
SQL_ID                         SQL_TEXT
------------------------------ ------------------------------------------------------------
7jywxtcmcmcgv                  select t0.stat_time stat_time,t2.consume_5cnt today,t1.consu
                               me_5cnt yestoday,t3.consume_5cnt lastweek  from   (select to
                               _char(trunc(sysdate)-level*5/60/24,'hh24:mi') as stat_time f
                               rom dual connect by level <=288  order by stat_time) t0  lef
                               t join   (select        to_char(a.stat_time,'hh24:mi') stat_
檢視執行計劃的情況如下,其實看計劃都是走了索引掃描。
$ sh showplan2.sh 7jywxtcmcmcgv
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  7jywxtcmcmcgv, child number 0
-------------------------------------
Plan hash value: 3396463774
------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                   | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                        |       |       |   171 (100)|          |
|   1 |  SORT ORDER BY                    |                        |     3 |   300 |   171   (2)| 00:00:03 |
|   2 |   NESTED LOOPS OUTER              |                        |     3 |   300 |   170   (1)| 00:00:03 |
|   3 |    NESTED LOOPS OUTER             |                        |     1 |    68 |     8   (0)| 00:00:01 |
|   4 |     NESTED LOOPS OUTER            |                        |     1 |    36 |     5   (0)| 00:00:01 |
|   5 |      VIEW                         |                        |     1 |     4 |     2   (0)| 00:00:01 |
|   6 |       CONNECT BY WITHOUT FILTERING|                        |       |       |            |          |
|   7 |        FAST DUAL                  |                        |     1 |       |     2   (0)| 00:00:01 |
|*  8 |      TABLE ACCESS BY INDEX ROWID  | DY_USER_ANALYSIS_MIN   |     1 |    32 |     3   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN            | IND_USER_MIN_DATE_GAME |     2 |       |     2   (0)| 00:00:01 |
|* 10 |     TABLE ACCESS BY INDEX ROWID   | DY_USER_ANALYSIS_MIN   |     1 |    32 |     3   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN             | IND_USER_MIN_DATE_GAME |     2 |       |     2   (0)| 00:00:01 |
|* 12 |    TABLE ACCESS BY INDEX ROWID    | DY_USER_ANALYSIS_MIN   |     3 |    96 |   162   (1)| 00:00:02 |
|* 13 |     INDEX RANGE SCAN              | IND_USER_MIN_DATE_GAME |   481 |       |   155   (1)| 00:00:02 |
------------------------------------------------------------------------------------------------------------
一邊分析一遍生成了一個調優報告。看看dbms_sqltune怎麼建議。
$ sh gentunerpt.sh  7jywxtcmcmcgv
 TASK_68645
GENERAL INFORMATION SECTION
-------------------------------------------------------------------------------
Tuning Task Name                  : TASK_68645
Tuning Task Owner                 : TESTDBA
Scope                             : COMPREHENSIVE
Time Limit(seconds)               : 120
Completion Status                 : COMPLETED
Started at                        : 12/07/2015 13:28:55
Completed at                      : 12/07/2015 13:30:59
Number of Index Findings          : 1
Number of Errors                  : 1
-------------------------------------------------------------------------------
Schema Name: TESTAT_NEWBG
SQL ID     : 7jywxtcmcmcgv
完整的語句如下:
SQL Text   : select t0.stat_time stat_time,t2.consume_5cnt
             today,t1.consume_5cnt yestoday,t3.consume_5cnt lastweek
              from
              (select to_char(trunc(sysdate)-level*5/60/24,'hh24:mi') as
             stat_time from dual connect by level <=288  order by stat_time)
             t0
              left join
              (select
                    to_char(a.stat_time,'hh24:mi') stat_time,
                    nvl(a.consume_5cnt,0) consume_5cnt
                 from   DY_USER_ANALYSIS_MIN a
              where
                a.game_type = :1
                and a.stat_time >= to_date(:2, 'yyyy-mm-dd')-(INTERVAL '1'
             DAY)
                and a.stat_time < trunc(to_date(:3, 'yyyy-mm-dd'))
                and a.group_id = :4
                order by a.stat_time) t1 on t0.stat_time=t1.stat_time
                left join (select
                    to_char(a.stat_time,'hh24:mi') stat_time,
                    nvl(a.consume_5cnt,0) consume_5cnt
                 from   DY_USER_ANALYSIS_MIN a
              where
                a.game_type = :5
                and a.stat_time >= to_date(:6, 'yyyy-mm-dd')
                and a.stat_time < trunc(to_date(:7, 'yyyy-mm-dd'))+(INTERVAL
             '1' DAY)
                and a.group_id = :8
                order by a.stat_time) t2  on t0.stat_time=t2.stat_time
                left join (select
                    to_char(a.stat_time,'hh24:mi') stat_time,
                    nvl(a.consume_5cnt,0) consume_5cnt
                 from   DY_USER_ANALYSIS_MIN a
              where
                a.game_type = :9
                and a.stat_time >= to_date(:10, 'yyyy-mm-dd')-(INTERVAL '7'
             DAY)
                and a.stat_time < trunc(to_date(:11, 'yyyy-mm-dd'))-(INTERVAL
             '6' DAY)
                and a.group_id = :12
                order by a.stat_time) t3  on t3.stat_time=t0.stat_time
                   order by t0.stat_time
給出的建議是建立一個索引。
-------------------------------------------------------------------------------
FINDINGS SECTION (1 finding)
-------------------------------------------------------------------------------
1- Index Finding (see explain plans section below)
--------------------------------------------------
  The execution plan of this statement can be improved by creating one or more
  indices.
  Recommendation (estimated benefit: 92.38%)
  ------------------------------------------
  - Consider running the Access Advisor to improve the physical schema design
    or creating the recommended index.
    create index TESTAT_NEWBG.IDX$$_10C250001 on
    TESTAT_NEWBG.DY_USER_ANALYSIS_MIN("GROUP_ID","GAME_TYPE","STAT_TIME");
然後在執行計劃的對比中,可以看到新的執行計劃選擇度要好很多。
- Using New Indices
Plan hash value: 2029808490
----------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                      |     1 |   100 |    13   (8)| 00:00:01 |
|   1 |  SORT ORDER BY                    |                      |     1 |   100 |    13   (8)| 00:00:01 |
|   2 |   NESTED LOOPS OUTER              |                      |     1 |   100 |    12   (0)| 00:00:01 |
|   3 |    NESTED LOOPS OUTER             |                      |     1 |    68 |     8   (0)| 00:00:01 |
|   4 |     NESTED LOOPS OUTER            |                      |     1 |    36 |     5   (0)| 00:00:01 |
|   5 |      VIEW                         |                      |     1 |     4 |     2   (0)| 00:00:01 |
|*  6 |       CONNECT BY WITHOUT FILTERING|                      |       |       |            |          |
|   7 |        FAST DUAL                  |                      |     1 |       |     2   (0)| 00:00:01 |
|   8 |      TABLE ACCESS BY INDEX ROWID  | DY_USER_ANALYSIS_MIN |     1 |    32 |     3   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN            | IDX$$_10C250001      |     1 |       |     2   (0)| 00:00:01 |
|  10 |     TABLE ACCESS BY INDEX ROWID   | DY_USER_ANALYSIS_MIN |     1 |    32 |     3   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN             | IDX$$_10C250001      |     1 |       |     2   (0)| 00:00:01 |
|  12 |    TABLE ACCESS BY INDEX ROWID    | DY_USER_ANALYSIS_MIN |     1 |    32 |     4   (0)| 00:00:01 |
|* 13 |     INDEX RANGE SCAN              | IDX$$_10C250001      |     1 |       |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

查看錶的屬性情況如下,可以看到當前存在兩個索引。


$ sh showtab.sh  TESTAT_NEWBG DY_USER_ANALYSIS_MIN|less
*******************************************
OWNER                          TABLE_NAME
------------------------------ ------------------------------
TESTAT_NEWBG                   DY_USER_ANALYSIS_MIN    
*******************************************
********** TABLE GENERAL INFO *****************
TABLE_NAME                     PAR TABLESPACE STATUS  INI_TRANS    NUM_ROWS     BLOCKS EMPTY_BLOCKS LOG MON ROW_MOVE LAST_ANALYZED
------------------------------ --- ---------- ------ ----------  ---------- ---------- ------------ --- --- -------- -------------------
DY_USER_ANALYSIS_MIN           NO  TE_DATA    VALID           1      372113       3268            0 YES YES DISABLED 2015-12-05 06:00:19
********** TABLE STORAGE INFO *****************
  INITEXT    NXTEXT     MINEXT      MAXEXT  FREELISTS  AVG_SPACE  CHAIN_CNT AVG_ROW_LEN CACHE      T DEPENDEN COMPRES
--------- --------- ---------- ----------- ---------- ---------- ---------- ----------- ---------- - -------- ----------
    65536   1048576          1  2147483645                     0          0          61     N      N DISABLED DISABLED
********** TABLE columns INFO *****************
 COLUMN_ID COLUMN_NAME                    DATA_TYPE       DATA_LENGTH NULLABLE   DATA_DEFAULT
---------- ------------------------------ --------------- ----------- ---------- --------------------
         1 ID                             NUMBER(11,0)             22 N
         2 STAT_TIME                      DATE                      7 Y
         3 GAME_TYPE                      VARCHAR2(50)             50 Y
         4 ZONE_ID                        NUMBER                   22 Y          0
         5 GROUP_ID                       NUMBER                   22 Y          0
         6 ONLINE_5CNT                    NUMBER                   22 Y          0
         7 ONLINE_20CNT                   NUMBER                   22 Y          0
....
19 rows selected.
********** INDEX DETAILS INFO *****************
INDEX_NAME                     TABLESPACE INDEX_TYPE UNIQUENES PAR  COLUMN_LIST                    TABLE_TYPE STATUS   NUM_ROWS  LAST_ANALYZED       G
------------------------------ ---------- ---------- --------- ---  ------------------------------ ---------- ------ ----------  ------------------- -
IND_USER_MIN_DATE_GAME         USERS      NORMAL     NONUNIQUE NO   STAT_TIME,GAME_TYPE            TABLE      VALID      371010 2015-12-05  06:00:21 N
SYS_C00106568                  TL_DATA    NORMAL     UNIQUE    NO   ID                             TABLE      VALID      371010 2015-12-05  06:00:20 Y

叢集因子的情況如下: TABLE_NAME INDEX_NAME CLUSTERING_FACTOR BLOCKS NUM_ROWS ------------------------------ ------------------------------ ----------------- ---------- ---------- DY_USER_ANALYSIS_MIN IND_USER_MIN_DATE_GAME 4700 3268 372113 DY_USER_ANALYSIS_MIN SYS_C00106568 4575 3268 372113 對於建立索引的如下建議。 create index TESTAT_NEWBG.IDX_DY_USER_ANA_IDTYPT on TESTAT_NEWBG.DY_USER_ANALYSIS_MIN("GROUP_ID","GAME_TYPE","STAT_TIME"); 可以通過如下的方式來做個簡單的分析。這個索引列的順序也是蠻講究,通過直方圖的資訊可以看到三個相關列的資料分佈情況。 select column_name,num_distinct,high_value,low_value ,avg_col_len,histogram from dba_tab_col_statistics where table_name='DY_USER_ANALYSIS_MIN' COLUMN_NAME NUM_DISTINCT HIGH_VALUE LOW_VALUE AVG_COL_LEN HISTOGRAM --------------------------- --------------- --------------------------- ----------- ID 503808 C33E403E C30C5164 5 NONE STAT_TIME 2992 78730C07170101 78730B1B0C3301 8 NONE GAME_TYPE 1 746C6262 746C6262 5 FREQUENCY ZONE_ID 1 3E6466 3E6466 4 FREQUENCY GROUP_ID 136 C25C44 3E6466 4 FREQUENCY 為什麼選用了GROUP_ID作為首選列呢,其中一個原因就是範圍查詢和等值查詢,在這個例子中範圍查詢就是stat_time相關的查詢,等值查詢就是 group_id相關的。這種情況下是優先選擇等值查詢的。而game_type的資料分佈很單一,所以這個列也不能作為首選列。 當然這些資訊也是在做了簡單的評估之後發現可行,所以馬上部署了。可以看到部署之後負載立馬降了下來。

可見這個變更也確實起到了立竿見影的效果。但是問題還不止於此,為什麼這個語句造成了嚴重的效能問題,經過後續和開發同事的討論,他們說這個新需求已經上 線兩週了。目前存在大量的left join的原因就是需要查詢這周,上週,上上週。。某一天的資料情況,按照這種思路,其實我還是建議他們去掉這種冗餘的left join,直接使用stat_time in (xxxx,xxxx,xxxx,xxx)其實這種更簡明直接。當然問題解決了還不是結束,還需要後續跟進,保證這種問題從根本上杜絕。