PostgreSQL在沒有備份情況下誤刪除Clog恢復
建立實驗表
postgres# create table t (n_id int primary key,c_name varchar(300)); CREATE TABLE postgres# insert into t select id,(id*1000)::text as name from generate_series(1,1000) id; INSERT 0 1000 postgres# delete from t where n_id =1000; DELETE 1 postgres# update t set c_name = 'cs' where n_id > 990; UPDATE 9 postgres# select * from t; -- 結果略,此步很關鍵 postgres# insert into t values ( 1001,'insert'),(1002,'insert'); INSERT 0 2 postgres# update t set c_name = 'update' where n_id = 1002; UPDATE 1
關閉,並備份資料庫
$ pg_ctl stop
等待伺服器程序關閉 .... 完成
伺服器程序已經關閉
$ cp -R $PGDATA $PGDATA/../pgdata_bak1
刪除clog檔案
$ cd $PGDATA/pg_xact
$ ls
0000 bak
$ rm 0000
啟動資料庫報錯
$ pg_ctl start
pg_ctl: 無法啟動伺服器程序
檢查日誌輸出.
略過一些不重要輸出,報無法開啟pg_xact/0000檔案。
$ tail postgresql-2020-08-11_123341.csv
授權失效處理方式:通知
臨近失效授權提醒天數:15
",,,,,,,,,""
使用dd命令建立一個clog檔案
因為一個clog檔案最大為256K,所以只需建立一個256K的檔案即可。寫入全0的檔案,代表所有事務均
在IN_PROGRESS狀態。
狀態標識 事務狀態
0x00 TRANSACTION_STATUS_IN_PROGRESS 0x01 TRANSACTION_STATUS_COMMITTED 0x02 TRANSACTION_STATUS_ABORTED 0x03 TRANSACTION_STATUS_SUB_COMMITTED $ dd if=/dev/zero of=$PGDATA/pg_xact/0000 bs=8K co
unt=32
記錄了32+0 的讀入
記錄了32+0 的寫出
262144位元組(262 kB)已複製,0.00102351 秒,256 MB/秒
再次啟動,驗證資料丟失
postgres# select * from t where n_id >=990; n_id | c_name ------+-------- 990 | 990000 991 | cs 992 | cs 993 | cs 994 | cs 995 | cs 996 | cs 997 | cs 998 | cs 999 | cs 1002 | insert
clog丟失,部分事務的影響依然得到保留
IN_PROGRESS 狀態的事務對資料的操作,其他會話應該是不可見的。通過檢視資料,可以瞭解到普通
資料元組由3部分組成,HeapTupleHeaderData結構、空值點陣圖及使用者資料。HeapTupleHeaderData
的結構如下,來源
Field Type Length Description t_xmin TransactionId 4 bytes insert XID stamp t_xmax TransactionId 4 bytes delete XID stamp t_cid CommandId 4 bytes insert and/or delete CID stamp (overlays with t_xvac) t_xvac TransactionId 4 bytes XID for VACUUM operation moving a row version t_ctid ItemPointerData 6 bytes current TID of this or newer row version t_infomask2 uint16 2 bytes number of attributes, plus various flag bits t_infomask uint16 2 bytes various flag bits t_hoff uint8 1 byte offset to user data
其中t_infomask也決定了行的可見性,而且t_infomask的優先順序更高。一共16位二進位制,每4位表示1
個含義。其中第二段用來判斷行的可見性。
#define HEAP_HASNULL 0x0001 /* has null attribute(s) */ #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */ #define HEAP_HASOID 0x0008 /* has an object-id field */ #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */ #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */ #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */ #define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */ /* xmax is a shared locker */ #define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK) #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \ HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ #define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */ #define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */ #define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN) #define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
根據剛剛的實驗情況看,我們大膽猜測:
t_infomask的更新時間為資料寫入後的首次訪問。因此在對id為1002的資料更新時,將1002的插入操
作的提交資訊寫入t_infomask欄位。所以上述實驗僅丟失對1001的插入和1002的更新。
檢視原始碼驗證
通過註釋可以瞭解到,為了避免對clog(10以上改名為xact)日誌的爭用,程式碼從以前的立刻更新
t_infomask,改為首次訪問時修改(postgres9.6版本修改,對應abase為3.6.1版本)。
heapam_visibility.c 962行
/* * HeapTupleSatisfiesMVCC * True iff heap tuple is valid for the given MVCC snapshot. * * See SNAPSHOT_MVCC's definition for the intended behaviour. * * Notice that here, we will not update the tuple status hint bits if the * inserting/deleting transaction is still running according to our snapshot, * even if in reality it's committed or aborted by now. This is intentional. * Checking the true transaction state would require access to high-traffic * shared data structures, creating contention we'd rather do without, and it * would not change the result of our visibility check anyway. The hint bits * will be updated by the first visitor that has a snapshot new enough to see * the inserting/deleting transaction as done. In the meantime, the cost of * leaving the hint bits unset is basically that each HeapTupleSatisfiesMVCC * call will need to run TransactionIdIsCurrentTransactionId in addition to * XidInMVCCSnapshot (but it would have to do the latter anyway). In the old * coding where we tried to set the hint bits as soon as possible, we instead * did TransactionIdIsInProgress in each call --- to no avail, as long as the * inserting/deleting transaction was still running --- which was more cycles * and more contention on the PGXACT array. */ static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { HeapTupleHeader tuple = htup->t_data; Assert(ItemPointerIsValid(&htup->t_self)); Assert(htup->t_tableOid != InvalidOid); if (!HeapTupleHeaderXminCommitted(tuple)) { if (HeapTupleHeaderXminInvalid(tuple)) return false; /* Used by pre-9.0 binary upgrades */ if (tuple->t_infomask & HEAP_MOVED_OFF) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (TransactionIdIsCurrentTransactionId(xvac)) return false; if (!XidInMVCCSnapshot(xvac, snapshot)) {if (TransactionIdDidCommit(xvac)) { SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } S etHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, InvalidTransactionId); } } / * Used by pre-9.0 binary upgrades */ else if (tuple->t_infomask & HEAP_MOVED_IN) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (!TransactionIdIsCurrentTransactionId(xvac)) { if (XidInMVCCSnapshot(xvac, snapshot)) return false; if (TransactionIdDidCommit(xvac)) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, InvalidTransactionId); else { SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } } } e lse if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple))) { if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid) return false; /* inserted after scan started */ if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */ return true; if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)) /* not deleter */ return true; if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) { TransactionId xmax; xmax = HeapTupleGetUpdateXid(tuple); /* not LOCKED_ONLY, so it has to have an xmax */ Assert(TransactionIdIsValid(xmax)); /* updating subtransaction must have aborted */ if (!TransactionIdIsCurrentTransactionId(xmax)) return true; else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid) return true; /* updated after scan started */ elseheapam_visibility.c 113行 return false; /* updated before scan started */ } i f (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) { /* deleting subtransaction must have aborted */ SetHintBits(tuple, buffer, HEAP_XMAX_INVALID, InvalidTransactionId); return true; } i f (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid) return true; /* deleted after scan started */ else return false; /* deleted before scan started */ } e lse if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); heapam_visibility.c 113行 /* * SetHintBits() * * Set commit/abort hint bits on a tuple, if appropriate at this time. * * It is only safe to set a transaction-committed hint bit if we know the * transaction's commit record is guaranteed to be flushed to disk before the * buffer, or if the table is temporary or unlogged and will be obliterated by * a crash anyway. We cannot change the LSN of the page here, because we may * hold only a share lock on the buffer, so we can only use the LSN to * interlock this if the buffer's LSN already is newer than the commit LSN; * otherwise we have to just refrain from setting the hint bit until some * future re-examination of the tuple. * * We can always set hint bits when marking a transaction aborted. (Some * code in heapam.c relies on that!) * * Also, if we are cleaning up HEAP_MOVED_IN or HEAP_MOVED_OFF entries, then * we can always set the hint bits, since pre-9.0 VACUUM FULL always used * synchronous commits and didn't move tuples that weren't previously * hinted. (This is not known by this subroutine, but is applied by its * callers.) Note: old-style VACUUM FULL is gone, but we have to keep this * module's support for MOVED_OFF/MOVED_IN flag bits for as long as we * support in-place update from pre-9.0 databases. * * Normal commits may be asynchronous, so for those we need to get the LSN * of the transaction and then check whether this is flushed. * * The caller should pass xid as the XID of the transaction to check, or * InvalidTransactionId if no check is needed. */ static inline void SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here! */ XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) && BufferGetLSNAtomic(buffer) < commitLSN) { /* not flushed and no LSN interlock, so don't set hint */ return; } } t uple->t_infomask |= infomask; MarkBufferDirtyHint(buffer, true); }
總結
資料備份至關重要。
行的可見性判斷除了根據這行上的xmin,xmax和clog的決定,行上的t_infomask也決定了行的可
見性。且優先順序高於clog。
事務修改並提交後,提交狀態不會立刻寫入t_infomask欄位,需要在記錄被首次訪問時才會寫入
(abase3.6.1以後)。
若clog丟失,採用dd一個全0檔案的方式啟動資料庫,則丟失部分資料(插入後未訪問,插入丟
失;刪除後未訪問,資料依然存在未被刪除;更新後未訪問,舊資料可見,更新的改變丟失)。
無資料庫備份,誤刪利用可見性原則恢復
注:僅支援特定情況下的恢復,強調必須做好資料備份,利用備份來保障資料安全。本例更多的是方便
大家理解mvcc、vacuum以及記錄可見性規則。
建立實驗表
刪除全部記錄,模擬誤刪
在開啟autovacuum狀態下,刪除表全部資料,在遇到執行autovacuum程序時,會將表全部資料清
理。這時資料就無法利用本方法恢復,所以刪除後要儘快關閉資料庫。
將autovacuum改為off
vi $PGDATA/postgresql.auto.conf
# 修改或增加如下配置
autovacuum = 'off'
啟動資料庫檢視刪除命令的事務id
# 本實驗這裡表新建立,看起來比較簡單。刪除的事務id只有一個,253975。get_raw_page函式第一個參
數是表名,第二個引數是page的編號,從0開始。
select t_xmax,* from heap_page_items(get_raw_page('public.t',0)); select t_xmax,* from heap_page_items(get_raw_page('public.t',1)); select t_xmax,* from heap_page_items(get_raw_page('public.t',2)); select t_xmax,* from heap_page_items(get_raw_page('public.t',3)); select t_xmax,* from heap_page_items(get_raw_page('public.t',4)); select t_xmax,* from heap_page_items(get_raw_page('public.t',5)); t_xmax | lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data --------+-----+--------+----------+--------+--------+--------+----------+------- --+-------------+------------+--------+--------+-------+------------------------ -- 253975 | 1 | 8152 | 1 | 33 | 253974 | 253975 | 0 | (0,1) | 8194 | 258 | 24 | | | \x010000000b31303030 253975 | 2 | 8112 | 1 | 33 | 253974 | 253975 | 0 | (0,2) | 8194 | 258 | 24 | | | \x020000000b32303030 253975 | 3 | 8072 | 1 | 33 | 253974 | 253975 | 0 | (0,3) | 8194 | 258 | 24 | | | \x030000000b33303030 253975 | 4 | 8032 | 1 | 33 | 253974 | 253975 | 0 | (0,4) | 8194 | 258 | 24 | | | \x040000000b34303030 253975 | 5 | 7992 | 1 | 33 | 253974 | 253975 | 0 | (0,5) | 8194 | 258 | 24 | | | \x050000000b35303030 253975 | 6 | 7952 | 1 | 33 | 253974 | 253975 | 0 | (0,6) | 8194 | 258 | 24 | | | \x060000000b36303030 253975 | 7 | 7912 | 1 | 33 | 253974 | 253975 | 0 | (0,7) | 8194 | 258 | 24 | | | \x070000000b37303030 253975 | 8 | 7872 | 1 | 33 | 253974 | 253975 | 0 | (0,8) | 8194 | 258 | 24 | | | \x080000000b38303030
關閉資料庫
$ pg_ctl stop
等待伺服器程序關閉 .... 完成
伺服器程序已經關閉
pg_resetwal修改下一個事務id為前面查到的誤刪事務id
$ pg_resetwal $PGDATA -x 253975
Write-ahead log reset
啟動資料庫
$ pg_ctl start
伺服器程序已經啟動
將誤刪資料備份到臨時表
當前資料庫下一個事務id為253975,因此事務id為253975的誤刪是不可見的。表內目前依然可以查詢
到誤刪的1000條資料,這些資料的xmax為253975。
postgres# create table t_del as select * from t where xmax=253975;
再次查詢t表,資料已經看不到了
因為上一個建表語句,事務id變大了1個。誤刪事務對資料的修改變為可見。
postgres# select * from t; n_id | c_name ------+-------- (0 行記錄)
將資料插回t表,完成資料恢復
postgres# insert into t select * from t_del; INSERT 0 1000 postgres# select count(*) from t; count ------- 1000 (1 行記錄)
執行sql使修改生效 select pg_reload_conf();
總結
資料備份至關重要。
由於abase的vacuum機制,刪除的資料,並不會立刻刪掉。只是做了相關的標誌。如果vacuum
一旦清理了這些資料,那麼是無法恢復。
autovacuum最低執行間隔由autovacuum_naptime引數控制,預設1分鐘。
執行autovacuum時,是否對錶進行vacuum,由autovacuum_vacuum_scale_factor引數及
autovacuum_vacuum_threshold引數共同決定,只有同時滿足dead tuple數量>=
autovacuum_vacuum_scale_factor*
reltuples(表上記錄數) + autovacuum_vacuum_threshold,才會對錶進行vacuum操作。
postgres# show autovacuum_naptime ; autovacuum_naptime -------------------- 1min postgres# show autovacuum_vacuum_scale_factor ; autovacuum_vacuum_scale_factor -------------------------------- 0.2 (1 行記錄) postgres# show autovacuum_vacuum_threshold ; autovacuum_vacuum_threshold ----------------------------- 50
&n