1. 程式人生 > 其它 >postgresql,MSSQL資料庫in對比

postgresql,MSSQL資料庫in對比

問題背景

在優化sqlserver的過程中發現in裡面的結果集有500個,速度非常慢,當刪除in裡面的大部分id時,又可以走索引,那麼這個臨界值是多少呢?abase又是什麼樣的呢?

SQLServer 2008

通過修改in裡面的資料量來測試是否走執行計劃

表資料量:CEDCLASSDTL:549444

--1.in裡面包含1000個id時的執行計劃:
SET STATISTICS TIME ON 
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in總共有1000個id,未完全列舉)
  |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0)))
       |--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH)
                 |--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'...
                 |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD)
--2.in裡面包含65個id時的執行計劃:  
SET STATISTICS TIME ON 
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in總共有65個id,未完全列舉)
    |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0)))
       |--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], [Expr1007]) WITH UNORDERED PREFETCH)
                 |--Constant Scan(VALUES:(({guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'}),({guid'2AD6A41F-2C42-4BA5-BF96-0042AEE35817'}),({guid'70CC515F-34D8-4D06-B3EA-00476D3DCE01'}),({guid'...
                 |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]=[Expr1004]) ORDERED FORWARD)
--3.in裡面包含64個id時的執行計劃:  
SET STATISTICS TIME ON 
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in總共有64個id,未完全列舉)
   |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
       |--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
            |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK..
--4.in裡面包含10個id時的執行計劃: 
  SET STATISTICS TIME ON 
go
SET STATISTICS IO ON
go
SET STATISTICS PROFILE ON
GO
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
......(in總共有10個id,未完全列舉)
    |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
       |--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
            |--Index Seek(OBJECT:([ntjers].[dbo].[CEDCLASSDTL].[IX_CEDCLASSFK]), SEEK:([ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK]={guid'882EDE67-F19E-4E3A-B75E-003BF34FB2C9'} OR [ntjers].[dbo].[CEDCLASSDTL].[CEDCLASSFK...

通過1,2的執行計劃可以看出,如果在in字句中有超過64個值,那麼就會構建一個內部的臨時表,然後索引掃描cedclassfk,最後巢狀迴圈。所以當in裡面個數越多巢狀迴圈也就越費時。當然也可能對臨時表進行排序後走Merge join。效率也不會太高。

通過3,4的執行計劃可以看出,當in裡面的個數小於等於64個的時候會走僅索引掃描。

在多次的測試過程中發現當in裡面的個數小於等於64的時候可以走僅索引掃描。當大於64以後就不再走索引。

(對於非常大的列表,比如in裡面超過10000個值,僅僅解析是非常昂貴的,而臨時表更可取。)

ArteryBase3.5.3

--將資料同步到abase
--in裡面包含1000個id時的執行計劃
explain analyze 
select count(*) from CEDCLASSDTL where cedclassfk in (
'882EDE67-F19E-4E3A-B75E-003BF34FB2C9',
'2AD6A41F-2C42-4BA5-BF96-0042AEE35817',
'70CC515F-34D8-4D06-B3EA-00476D3DCE01',
'09C34234-7D19-4501-AB1A-0049AD22E76F',
  ...--未完全列舉)
Aggregate  (cost=9822.67..9822.68 rows=1 width=0) (actual time=31.761..31.761 rows=1 loops=1)
  ->  Index Only Scan using ix_cedclassfk on cedclassdtl  (cost=0.42..9446.54 rows=150454 width=0) (actual time=0.278..26.161 rows=120662 loops=1)
        Index Cond: (cedclassfk = ANY ('{882EDE67-F19E-4E3A-B75E-003BF34FB2C9,2AD6A41F-2C42-4BA5-BF96-0042AEE35817,70CC515F-34D8-4D06-B3EA-00476D3DCE01,09C34234-7D19-4501-AB1A-0049AD22E76F,E30BE907-C42E-4DD8-A9A7-004BBE65962E,96375C84-CFDA-450E-A95A-0061B0398D17,2A8EF0A8-DC91-4D1A-B7D2-0083651305B6,43A07E25-6F0B-4157-A1CA-008DFF8EB402,2CF8262F-3DBF-4201-A3AC-009143E3BBFC,....(1000個id)}'::bpchar[]))
        Heap Fetches: 0
Planning time: 55.979 ms
Execution time: 35.019 ms

abase1000個id可以走索引。

abase-in裡面最大可以支援多少個id呢?

--建立表
db_ntjers=# create table tab(id int); 
CREATE TABLE
--插入10w資料
db_ntjers=# insert into tab select generate_series(1,100000);  
INSERT 0 100000
db_ntjers=# create index i_tab_id on tab(id);
CREATE INDEX
--構造子串
select string_agg(id::varchar,',') from (select id from tab limit 1000)ta
--1000個id的執行計劃
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....1000)
Index Only Scan using i_tab_id on tab  (cost=0.29..1422.00 rows=1000 width=4) (actual time=0.251..1.097 rows=1000 loops=1)
  Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34...1000}'::integer[]))
  Heap Fetches: 1000
Planning time: 0.822 ms
Execution time: 1.184 ms
--1w個id的執行計劃
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....10000)
Index Only Scan using i_tab_id on tab  (cost=0.29..4252.00 rows=10000 width=4) (actual time=0.018..8.568 rows=10000 loops=1)
  Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...9999,10000}'::integer[]))
  Heap Fetches: 10000
Planning time: 5.449 ms
Execution time: 9.110 ms
--10w個id的執行計劃
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000)
Index Only Scan using i_tab_id on tab  (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.024..110.276 rows=100000 loops=1)
  Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36...99999,100000}'::integer[]))
  Heap Fetches: 100000
Planning time: 67.080 ms
Execution time: 116.627 ms
--100w個id的執行計劃
db_ntjers=# explain analyze select * from tab where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....100000)
Index Only Scan using i_tab_id on tab  (cost=0.42..452602.21 rows=632121 width=4) (actual time=0.300..1443.868 rows=1000000 loops=1)
  Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48(...)}')
  Heap Fetches: 1000000
Planning time: 713.382 ms
Execution time: 1507.180 ms
               
--資料型別為char也是一樣,10wid
db_ntjers=# select string_agg('\''||c_bh||'\'',',') from (select c_bh from tabl_uuid limit 100000)ta
db_ntjers=#explain analyze select * from tabl_uuid where c_bh in('da370980-559e-4d7c-af1a-59ac381e5bdd','2415c46d-8ccc-4b21-9901-8248023edab6',...)
Index Only Scan using i_tabl_uuid_c_bh on tabl_uuid  (cost=0.42..49731.99 rows=100000 width=37) (actual time=0.239..723.233 rows=100000 loops=1)
Index Cond: (c_bh = ANY ('{da370980-559e-4d7c-af1a-59ac381e5bdd,2415c46d-8ccc-4b21-9901-8248023edab6,ad55941d-e9fa-4e0b-be39-dd0855514ebd,..'::bpchar[]))
  Heap Fetches: 100000
Planning time: 86.628 ms
Execution time: 1286.896 ms

可以看到abase的in裡面不管包含多少值都會走索引。此處也可以看出int和char的效率區別,同樣是10w資料int的效率是char的10倍。

但是在這個測試中id都是有序的,如果無序呢?

無序的id子串

--構造無序的id子串
select string_agg((random()*100000)::int::text, ',') into arr from generate_series(1, 100000);  
--10w個無需的id執行計劃
db_ntjers=#explain analyze select * from tab where id in(16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005....)
Index Only Scan using i_tab_id on tab  (cost=0.29..32550.00 rows=100000 width=4) (actual time=0.179..63.821 rows=63232 loops=1)
  Index Cond: (id = ANY ('{16008,48047,80229,42403,86136,85790,15910,47880,34498,58849,88691,69997,49239,46005,....85832}'::integer[]))
  Heap Fetches: 63232
Planning time: 76.747 ms
Execution time: 99.788 ms

無序的id也可以走索引。可能有人會注意到無序的要快一點,實際上由於用了random()函式,所以id中可能有重複的id,所以最終返回的結果只有6W+條。

abase為何會有這樣的特性呢?

在使用in的時候實際上是轉成了=any(array),所以當直接構造一個=any的時候效率會更高。當然在in裡裡面的個數低於1w的時候in和=any()的區別並不大。到10w和100w的時候區別就較為明顯,資料量越大區別越明顯。

詳情

結語

1.sqlserver的in裡面的個數小於等於64的時候能走僅索引掃描,當大於64以後越大效率越低。

2.abase的in實際上轉換成了=any(array),和in裡面的個數關係不大。