postgresql,MSSQL資料庫in對比
問題背景
在優化sqlserver的過程中發現in裡面的結果集有500個,速度非常慢,當刪除in裡面的大部分id時,又可以走索引,那麼這個臨界值是多少呢?abase又是什麼樣的呢?
SQLServer 2008
通過修改in裡面的資料量來測試是否走執行計劃
表資料量:CEDCLASSDTL:549444 --1.in裡面包含1000個id時的執行計劃: SET STATISTICS TIME ON go SET STATISTICS IO ON go SET STATISTICS PROFILE ON GO select count(*) from CEDCLASSDTL where cedclassfk in ( '882EDE67-F19E-4E3A-B75E-003BF34FB2C9', '2AD6A41F-2C42-4BA5-BF96-0042AEE35817', '70CC515F-34D8-4D06-B3EA-00476D3DCE01', ......(in總共有1000個id,未完全列舉) |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0))) |--Stream Aggregate(DEFINE:([Expr1008]=Count(*))) |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH) |--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'... |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD) --2.in裡面包含65個id時的執行計劃: SET STATISTICS TIME ON go SET STATISTICS IO ON go SET STATISTICS PROFILE ON GO select count(*) from CEDCLASSDTL where cedclassfk in ( '882EDE67-F19E-4E3A-B75E-003BF34FB2C9', '2AD6A41F-2C42-4BA5-BF96-0042AEE35817', '70CC515F-34D8-4D06-B3EA-00476D3DCE01', ......(in總共有65個id,未完全列舉) |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0))) |--Stream Aggregate(DEFINE:([Expr1008]=Count(*))) |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH) |--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'70CC515F-34D8-4D06-B3EA-00476D3DCE01'}),({guid'... |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD) --3.in裡面包含64個id時的執行計劃: SET STATISTICS TIME ON go SET STATISTICS IO ON go SET STATISTICS PROFILE ON GO select count(*) from CEDCLASSDTL where cedclassfk in ( '882EDE67-F19E-4E3A-B75E-003BF34FB2C9', '2AD6A41F-2C42-4BA5-BF96-0042AEE35817', '70CC515F-34D8-4D06-B3EA-00476D3DCE01', ......(in總共有64個id,未完全列舉) |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0))) |--Stream Aggregate(DEFINE:([Expr1004]=Count(*))) |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK.. --4.in裡面包含10個id時的執行計劃: SET STATISTICS TIME ON go SET STATISTICS IO ON go SET STATISTICS PROFILE ON GO select count(*) from CEDCLASSDTL where cedclassfk in ( '882EDE67-F19E-4E3A-B75E-003BF34FB2C9', '2AD6A41F-2C42-4BA5-BF96-0042AEE35817', '70CC515F-34D8-4D06-B3EA-00476D3DCE01', ......(in總共有10個id,未完全列舉) |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0))) |--Stream Aggregate(DEFINE:([Expr1004]=Count(*))) |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK...
通過1,2的執行計劃可以看出,如果在in字句中有超過64個值,那麼就會構建一個內部的臨時表,然後索引掃描cedclassfk,最後巢狀迴圈。所以當in裡面個數越多巢狀迴圈也就越費時。當然也可能對臨時表進行排序後走Merge join。效率也不會太高。
通過3,4的執行計劃可以看出,當in裡面的個數小於等於64個的時候會走僅索引掃描。
在多次的測試過程中發現當in裡面的個數小於等於64的時候可以走僅索引掃描。當大於64以後就不再走索引。
(對於非常大的列表,比如in裡面超過10000個值,僅僅解析是非常昂貴的,而臨時表更可取。)
ArteryBase3.5.3
--將資料同步到abase --in裡面包含1000個id時的執行計劃 explain analyze select count(*) from CEDCLASSDTL where cedclassfk in ( '882EDE67-F19E-4E3A-B75E-003BF34FB2C9', '2AD6A41F-2C42-4BA5-BF96-0042AEE35817', '70CC515F-34D8-4D06-B3EA-00476D3DCE01', '09C34234-7D19-4501-AB1A-0049AD22E76F', ...--未完全列舉) Aggregate (cost=9822.67..9822.68 rows=1 width=0) (actual time=31.761..31.761 rows=1 loops=1) -> Index Only Scan using ix_cedclassfk on cedclassdtl (cost=0.42..9446.54 rows=150454 width=0) (actual time=0.278..26.161 rows=120662 loops=1) Index Cond: (cedclassfk = ANY ('{882EDE67-F19E-4E3A-B75E-003BF34FB2C9,2AD6A41F-2C42-4BA5-BF96-0042AEE35817,70CC515F-34D8-4D06-B3EA-00476D3DCE01,09C34234-7D19-4501-AB1A-0049AD22E76F,E30BE907-C42E-4DD8-A9A7-004BBE65962E,96375C84-CFDA-450E-A95A-0061B0398D17,2A8EF0A8-DC91-4D1A-B7D2-0083651305B6,43A07E25-6F0B-4157-A1CA-008DFF8EB402,2CF8262F-3DBF-4201-A3AC-009143E3BBFC,....(1000個id)}'::bpchar[])) Heap Fetches: 0 Planning time: 55.979 ms Execution time: 35.019 ms
abase1000個id可以走索引。
abase-in裡面最大可以支援多少個id呢?
--建立表 db_ntjers=# create table tab(id int); CREATE TABLE --插入10w資料 db_ntjers=# insert into tab select generate_series(1,100000); INSERT 0 100000 db_ntjers=# create index i_tab_id on tab(id); CREATE INDEX --構造子串 select string_agg(id::varchar,',') from (select id from tab limit 1000)ta --1000個id的執行計劃 db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....1000) Index Only Scan using i_tab_id on tab (cost=0.29..1422.00 rows=1000 width=4) (actual time=0.251..1.097 rows=1000 loops=1) Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34...1000}'::integer[])) Heap Fetches: 1000 Planning time: 0.822 ms Execution time: 1.184 ms --1w個id的執行計劃 db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....10000) Index Only Scan using i_tab_id on tab (cost=0.29..4252.00 rows=10000 width=4) (actual time=0.018..8.568 rows=10000 loops=1) Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...9999,10000}'::integer[])) Heap Fetches: 10000 Planning time: 5.449 ms Execution time: 9.110 ms --10w個id的執行計劃 db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000) Index Only Scan using i_tab_id on tab (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.024..110.276 rows=100000 loops=1) Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...99999,100000}'::integer[])) Heap Fetches: 100000 Planning time: 67.080 ms Execution time: 116.627 ms --100w個id的執行計劃 db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000) Index Only Scan using i_tab_id on tab (cost=0.42..452602.21 rows=632121 width=4) (actual time=0.300..1443.868 rows=1000000 loops=1) Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48(...)}') Heap Fetches: 1000000 Planning time: 713.382 ms Execution time: 1507.180 ms --資料型別為char也是一樣,10wid db_ntjers=# select string_agg('\''||c_bh||'\'',',') from (select c_bh from tabl_uuid limit 100000)ta db_ntjers=#explain analyze select * from tabl_uuid where c_bh in('da370980-559e-4d7c-af1a-59ac381e5bdd','2415c46d-8ccc-4b21-9901-8248023edab6',...) Index Only Scan using i_tabl_uuid_c_bh on tabl_uuid (cost=0.42..49731.99 rows=100000 width=37) (actual time=0.239..723.233 rows=100000 loops=1) Index Cond: (c_bh = ANY ('{da370980-559e-4d7c-af1a-59ac381e5bdd,2415c46d-8ccc-4b21-9901-8248023edab6,ad55941d-e9fa-4e0b-be39-dd0855514ebd,..'::bpchar[])) Heap Fetches: 100000 Planning time: 86.628 ms Execution time: 1286.896 ms
可以看到abase的in裡面不管包含多少值都會走索引。此處也可以看出int和char的效率區別,同樣是10w資料int的效率是char的10倍。
但是在這個測試中id都是有序的,如果無序呢?
無序的id子串
--構造無序的id子串
select string_agg((random()*100000)::int::text, ',') into arr from generate_series(1, 100000);
--10w個無需的id執行計劃
db_ntjers=#explain analyze select * from tab where id in(16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005....)
Index Only Scan using i_tab_id on tab (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.179..63.821 rows=63232 loops=1)
Index Cond: (id = ANY ('{16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005,....85832}'::integer[]))
Heap Fetches: 63232
Planning time: 76.747 ms
Execution time: 99.788 ms
無序的id也可以走索引。可能有人會注意到無序的要快一點,實際上由於用了random()函式,所以id中可能有重複的id,所以最終返回的結果只有6W+條。
abase為何會有這樣的特性呢?
在使用in的時候實際上是轉成了=any(array),所以當直接構造一個=any的時候效率會更高。當然在in裡裡面的個數低於1w的時候in和=any()的區別並不大。到10w和100w的時候區別就較為明顯,資料量越大區別越明顯。
結語
1.sqlserver的in裡面的個數小於等於64的時候能走僅索引掃描,當大於64以後越大效率越低。
2.abase的in實際上轉換成了=any(array),和in裡面的個數關係不大。