PostgreSQL中index only scan並不總是僅掃描索引
postgresql從9.2開始就引入了僅索引掃描(index only scans)。但不幸的是,並不是所有的index only scans都不會再訪問表。
postgres=# create table t1(a int,b int,c int); CREATE TABLE postgres=# insert into t1 select a.*,a.*,a.* from generate_series(1,1000000) a; INSERT 0 1000000 postgres-# \d+ t1 Table "public.t1" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+---------+--------------+------------- a | integer | | | | plain | | b | integer | | | | plain | | c | integer | | | | plain | | postgres-#
執行下面這種沒有索引可用的查詢,需要讀取整個表獲取資料:
postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN --------------------------------------------------------------------------- Gather (actual time=1.069..70.557 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=5406 -> Parallel Seq Scan on t1 (actual time=11.805..34.050 rows=0 loops=3) Filter: (b = 5) Rows Removed by Filter: 333333 Buffers: shared hit=5406 Planning Time: 0.414 ms Execution Time: 70.612 ms (10 rows) postgres=#
這裡,postgresql決定使用並行順序掃描(parallel sequential scan)是對的。當然在沒有索引的情況下,還有另一個選擇是使用序列順序掃描(serial sequential scan)。通常,我們會在表上建立索引。
postgres=# create index i1 on t1(b); CREATE INDEX postgres=# \d t1 Table "public.t1" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | integer | | | Indexes: "i1" btree (b)
這樣就可以使用索引返回資料:
postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN --------------------------------------------------------------------- Index Scan using i1 on t1 (actual time=0.066..0.068 rows=1 loops=1) Index Cond: (b = 5) Buffers: shared hit=1 read=3 Planning Time: 0.773 ms Execution Time: 0.128 ms (5 rows) postgres=#
從執行計劃就可以看到,使用了索引,但是postgresql仍然需要訪問表獲取列a的值。我們還可以建立一個索引,包含我們需要的所有列:
postgres=# create index i2 on t1(b,a); CREATE INDEX postgres=# \d+ t1 Table "public.t1" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+---------+--------------+------------- a | integer | | | | plain | | b | integer | | | | plain | | c | integer | | | | plain | | Indexes: "i1" btree (b) "i2" btree (b, a) postgres=#
再來看看剛才的查詢語句的執行情況:
postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using i2 on t1 (actual time=0.346..0.353 rows=1 loops=1) Index Cond: (b = 5) Heap Fetches: 1 Buffers: shared hit=1 read=3 Planning Time: 0.402 ms Execution Time: 0.401 ms (6 rows) postgres=#
但是仍然有一個Heap Fetches:1
為什麼呢?為了回答這個問題,我們先看看t1表在磁碟上的檔案:
postgres=# select pg_relation_filepath('t1'); pg_relation_filepath ---------------------- base/13878/74982 (1 row) postgres=# \! ls -l /pg/11/data/base/13878/74982* -rw------- 1 postgres postgres 44285952 Oct 31 15:12 /pg/11/data/base/13878/74982 -rw------- 1 postgres postgres 32768 Oct 31 15:08 /pg/11/data/base/13878/74982_fsm postgres=#
這個表有個free space map檔案,但是還沒有visibility map檔案。沒有visibility map,postgresql就不知道是否所有的行對當前事務都是可見的,因此需要去訪問表獲取資料。當建立了visibility map之後:
postgres=# vacuum t1; VACUUM postgres=# \! ls -l /pg/11/data/base/13878/74982* -rw------- 1 postgres postgres 44285952 Oct 31 15:12 /pg/11/data/base/13878/74982 -rw------- 1 postgres postgres 32768 Oct 31 15:08 /pg/11/data/base/13878/74982_fsm -rw------- 1 postgres postgres 8192 Oct 31 15:39 /pg/11/data/base/13878/74982_vm postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using i2 on t1 (actual time=0.044..0.045 rows=1 loops=1) Index Cond: (b = 5) Heap Fetches: 0 Buffers: shared hit=4 Planning Time: 0.230 ms Execution Time: 0.102 ms (6 rows) postgres=#
這裡,Heap Fetches:0
說明沒有從表獲取資料,真正做到了僅索引掃描(不過或掃描visiblity map)
為了描述更清楚點,來看看行的物理位置:
postgres=# select ctid,* from t1 where b=5; ctid | a | b | c -------+---+---+--- (0,5) | 5 | 5 | 5 (1 row) postgres=#
可以看到,行位於block 0,且是第五行。我們來看看block中的行是否對所有事務都可見:
postgres=# create extension pg_visibility; CREATE EXTENSION postgres=# select pg_visibility_map('t1'::regclass, 0); pg_visibility_map ------------------- (t,f) (1 row) postgres=#
t表示所有可見。如果,我們在另一個會話中更新一行會怎麼樣?
在session2中執行:
postgres=# update t1 set a=8 where b=5; UPDATE 1 postgres=#
回來原來的會話,再次檢視:
postgres=# select pg_visibility_map('t1'::regclass, 0); pg_visibility_map ------------------- (f,f) (1 row) postgres=#
這裡可以看到:
1.對頁的修改清除了visibility map
2.僅索引掃描需要回表
postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using i2 on t1 (actual time=0.080..0.082 rows=1 loops=1) Index Cond: (b = 5) Heap Fetches: 2 Buffers: shared hit=6 dirtied=3 Planning Time: 0.132 ms Execution Time: 0.120 ms (6 rows) postgres=#
現在的問題是:為什麼Heap Fetches:2
首先,postgresql中每個update都會建立一個新行:
postgres=# select ctid,* from t1 where b=5; ctid | a | b | c -----------+---+---+--- (5405,76) | 8 | 5 | 5 (1 row) postgres=#
現在,這行資料在新的block中(即使是在同一個block中,也是在另一個地方),這當然也會影響指向該行的索引條目。索引仍然指向該行的老版本,同時有一個指標指向行的當前版本,因此需要兩次Heap Fetches(當你更新的列不在索引中時,被稱作hot update)。
下一次執行,我們可以看到只有一次訪問表:
postgres=# explain (analyze,buffers,costs off) select a from t1 where b = 5; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using i2 on t1 (actual time=0.039..0.042 rows=1 loops=1) Index Cond: (b = 5) Heap Fetches: 1 Buffers: shared hit=5 Planning Time: 0.112 ms Execution Time: 0.071 ms (6 rows) postgres=#
這裡,還不清楚為什麼變成了一次!!!
需要明白的是,index only scans並不總是僅掃描索引。