一條sql語句“導致”的資料庫宕機問題及分析 (38天)
最近測試環境需要做一些變更,把測試環境切分成兩套環境,儲存空間也需要壓縮壓縮和整理。 unix組的人已經開始做空間劃分了,然後我們需要在此基礎上重建一套環境。 有些資料檔案使用空間不大,所以準備壓縮一下。 用了下面的sql語句,結果跑了十幾秒中就拋了下面的錯誤。
SQL> set linesize 200 SQL> col name for a40 SQL> col resizecmd for a80 SQL> select a.file#,a.name,a.bytes/1024/1024 CurrentMB, 2 ceil(HWM * a.block_size)/1024/1024 ResizeTo, 3 (a.bytes - HWM * a.block_size)/1024/1024 ReleaseMB, 4 'alter database datafile '''||a.name||''' resize '|| 5 ceil(HWM * a.block_size/1024/1024) || 'M;' ResizeCMD 6 from v$datafile a, 7 (select file_id,max(block_id+blocks-1) HWM 8 from dba_extents 9 group by file_id) b 1 0 where a.file# = b.file_id(+) 11 and (a.bytes - HWM *block_size)>0 order by 5 12 ; order by 5 * ERROR at line 12: ORA-01116: error in opening database file 4 ORA-01110: data file 4: '/testPT1/oracle/TEST01/oratmp01/temp/ggs_data01.dbf' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3
再一檢視資料庫程序,都沒有了,看來資料庫是宕了。 我還想這條sql語句真是厲害,看看日誌裡面怎麼說。 Tue Mar 25 22:04:19 2014 End automatic SQL Tuning Advisor run for special tuning task "SYS_AUTO_SQL_TUNING_TASK" Wed Mar 26 02:00:00 2014 Closing scheduler window Closing Resource Manager plan via scheduler window Clearing Resource Manager plan via parameter Mon Mar 31 10:14:55 2014 USER (ospid: 21846): terminating the instance due to error 472 Instance terminated by USER, pid = 21846 先把庫重啟了,看有沒有什麼問題。
SQL> startup nomount ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance ORACLE instance started. Total System Global Area 1.2827E+10 bytes Fixed Size 2253880 bytes Variable Size 4211084232 bytes Database Buffers 8589934592 bytes Redo Buffers 24096768 bytes SQL> alter database mount; Database altered. 當準備Open的時候,報了下面的錯誤。說有個資料檔案丟失了。 SQL> alter database open; alter database open * ERROR at line 1: ORA-01157: cannot identify/lock data file 4 - see DBWR trace file ORA-01110: data file 4: '/testPT1/oracle/TEST01/oratmp01/temp/ggs_data01.dbf' 檢視備份情況,是否有熱備份之類的操作遺留。可以看到這個檔案確實是損壞了或者被認為刪除了。 SQL> select *from v$backup; FILE# STATUS CHANGE# TIME ---------- ------------------ ---------- --------- 1 NOT ACTIVE 1.0583E+13 22-JAN-13 2 NOT ACTIVE 1.0583E+13 22-JAN-13 3 NOT ACTIVE 1.0583E+13 22-JAN-13 4 FILE NOT FOUND 0 ... 40 NOT ACTIVE 1.0583E+13 22-JAN-13 檢視檔案的路徑。 1* select file#,name from v$datafile 4 /testPT1/oracle/TEST01/oratmp01/temp/ggs_data01.dbf 確認了下這個資料檔案goldengate用,現在測試環境上還沒有goldengate,可以刪除,於是頭腦一發熱,就準備馬上刪了,結果報了錯。 SQL> drop tablespace GGS_DATA including contents and datafiles cascade constraint; drop tablespace GGS_DATA including contents and datafiles cascade constraint * ERROR at line 1: ORA-01109: database not open 才反應過來資料庫還在mount狀態 先把資料檔案offline了 SQL> alter database datafile '/testPT1/oracle/TEST01/oratmp01/temp/ggs_data01.dbf' offline; Database altered. 這時候再檢視v$backup,那條記錄就不復存在了。 FILE# STATUS CHANGE# TIME ---------- ------------------ ---------- --------- 1 NOT ACTIVE 1.0583E+13 22-JAN-13 2 NOT ACTIVE 1.0583E+13 22-JAN-13 3 NOT ACTIVE 1.0583E+13 22-JAN-13 5 NOT ACTIVE 1.0583E+13 22-JAN-13 6 NOT ACTIVE 1.0583E+13 22-JAN-13 7 NOT ACTIVE 1.0583E+13 22-JAN-13 8 NOT ACTIVE 1.0583E+13 22-JAN-13 9 NOT ACTIVE 1.0583E+13 22-JAN-13 10 NOT ACTIVE 1.0583E+13 22-JAN-13 11 NOT ACTIVE 1.0583E+13 22-JAN-13 12 NOT ACTIVE 1.0583E+13 22-JAN-13 .... 39 rows selected. 把資料庫open起來。 SQL> alter database open; Database altered. 然後再刪除,其實這個階段也只是釋放了控制代碼和更新了資料字典。 SQL> drop tablespace ggs_data including contents and datafiles cascade constraint; Tablespace dropped. 再次檢視資料庫程序是否正常。可以看到資料庫已經正常了。 test01@ccbdbpt4:/opt/app/oracle/TEST01> ps -ef|grep smon test018420 5954 0 10:39 pts/2 00:00:00 grep smon test0110295 1 0 10:18 ? 00:00:01 ora_smon_TEST01 test01@ccbdbpt4:/opt/app/oracle/TEST01> sqlplus / as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Mon Mar 31 10:40:08 2014 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> show user USER is "SYS" SQL> select status from v$instance; STATUS ------------ OPEN 我在另一個環境上碰到了類似的問題。 test01@ccbdbpt4:/opt/app/oracle/TEST01/test> sqlplus / as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Thu Apr 3 17:26:43 2014 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> SQL> select dbms_metadata.get_ddl('USER',u.username) from dba_users u WHERE USER NAME in('TEST'); ERROR: ORA-03113: end-of-file on communication channel Process ID: 4056 Session ID: 4159 Serial number: 15 no rows selected 檢視日誌,錯誤還是類似。 Mon Mar 31 11:47:47 2014 USER (ospid: 26945): terminating the instance due to error 472 Instance terminated by USER, pid = 26945 這個問題的總結如下 導致資料庫宕機的原因不是因為sql語句,而是因為unix組做檔案的操作中,導致資料庫程序問題,資料庫其實已經停了,但是我們仍然可以進行簡單的操作。因為作業系統中控制代碼還在。 那個資料檔案的丟失和unix組的人確認是認為的失誤,他們在做空間切分的時候,沒有停庫。