資料倉庫設計的隱患-標量子查詢
首先,來理解一下標量子查詢:處於select之後from之前的子查詢稱為標量子查詢 .比如:select num1,cal,(select name from t2 where t2.id = t1.id)from t1;舉這個例子只是為了方便理解標量的含義。當然定義為返回單列的選擇語句,或者返回一行的表示式的子查詢稱為標量子查詢。
標量子查詢的缺點十分明顯:驅動表固定是外表t1, t1返回的結果集傳值t2得到結果。所以如果t1表過大(或者以後t1表逐漸增長以後會變得很大)。將會引起很大的效能問題【資料倉庫跑批流程應該禁用標量子查詢】
今天這條SQL運行了5個小時沒出結果。。。。。。。(不得不說耐心真好,我一般最多等十分鐘)
SELECT /*+ NO_USE_HASH(C,B)*/ C.CUST_ACCT_NO, C.PRIM_ACCT, ACCOUNT_SYSTEM, CUSTOMER_TYPE, CUSTOMER_STATUS, CREATE_DT, HOME_BRANCH_NO, COMPANY_SIZE, NOTICE_IND, NOTICE_CUST_NO, STMT_FREQUENCY, STMT_CYCLE, STMT_DAY, ID_NO, ID_TYPE, SHORT_NAME, EMAIL_ADD1, EMAIL_ADD2, CREDIT_RANKING, TITLE_CODE, NAME1, ADD1, POSTCODE, PHONE_NO_RES, PHONE_RES_EXT, PHONE_NO_BUS, PHONE_BUS_EXT, FAX_NO, TELEX_NO, PCODE_RGSTER, REGSTR_ADD1, REGSTR_ADD2, PHONE_RGSTR_NO, PHONE_RGSTR_EXT, BIRTH_DATE_1, SEX_CODE, EMPLOYER_NAME, EMPLOYED_FROM, EMPLOYER_ADDR, OCCUP_DESCRIP, OCCUPATION_CODE, INCOME, INCOME_WMY, COMPANY_NO, BUSINESS_NO, LICENCE_NO, BOSS_NAME, BOSS_BDAY, BUS_RGSTR_DATE, CAPITAL_AMT, CONTACT_REL_1, PHONE_NO_1, ADD2, ADD3, ADD4, MOBILE_NO, FXSP_TYPE, INDUSTRY_CODE, BUS_SECTOR_CODE, CUST_SUB_TYPE, DEP_STMT_TYPE, ID_ISSUE_DATE, ID_EXP_DATE, REGISTRY_ADD, ID_ISSUE_PLAC, LST_MNT_DATE, B.BRANCH_NO FROM CUSM_T C INNER JOIN (SELECT DISTINCT CUSTOMER_NO, (SELECT SJJGM FROM JGDY H WHERE H.JGM = CB_ACCT.BRANCH_NO) BRANCH_NO FROM CB_ACCT) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
通過SQL和PLAN都可以很容易的找出標量子查詢Plan hash value: 2079508004 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 18M| 14G| | 1793K (1)| 00:04:41 | |* 1 | INDEX SKIP SCAN | JGDY_IDX3 | 1 | 8 | | 1 (0)| 00:00:01 | | 2 | MERGE JOIN | | 18M| 14G| | 1793K (1)| 00:04:41 | | 3 | SORT JOIN | | 18M| 397M| | 147K (1)| 00:00:24 | | 4 | VIEW | | 18M| 397M| | 147K (1)| 00:00:24 | | 5 | HASH UNIQUE | | 18M| 380M| 1107M| 147K (1)| 00:00:24 | | 6 | TABLE ACCESS STORAGE FULL| CB_ACCT2 | 36M| 760M| | 71431 (1)| 00:00:12 | |* 7 | SORT JOIN | | 19M| 14G| 35G| 1645K (1)| 00:04:18 | | 8 | TABLE ACCESS STORAGE FULL | CUSM_T | 19M| 14G| | 306K (1)| 00:00:48 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("H"."JGM"=:B1) filter("H"."JGM"=:B1) 7 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO") filter("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
1.select之後from之前,當然這個SQL的標量隱藏在了內聯視圖裡面了
(SELECT DISTINCT CUSTOMER_NO, (SELECT SJJGM FROM JGDY H WHERE H.JGM = CB_ACCT.BRANCH_NO) BRANCH_NO FROM CB_ACCT) B
2.PLAN的id=1和2那兩步,縮排一樣,而且沒有連線方式的父親節點
|* 1 | INDEX SKIP SCAN | JGDY_IDX3 | 1 | 8 | | 1 (0)| 00:00:01 |
| 2 | MERGE JOIN | | 18M| 14G| | 1793K (1)| 00:04:41 |
通過這兩點均可以判斷SQL裡面包含標量.如果SQL特別長就直接看PLAN就行了
標量是否產生效能問題,注意取決於主表(外表)返回的行數.其實我們都知道這種資料倉庫跑批的表不可能小。象徵性的查一下
我之前的部落格裡面發過這個指令碼 http://blog.csdn.net/skybig1988/article/details/71125223 也可以自己定製,很簡單
可以看出表的行數很大,不適合走標量(>10000行)
對於標量子查詢,只能通過改寫【標量子查詢可以等價改寫為外連線】
當然此處的標量改寫十分簡單.有些複雜的比如 聚合類、不等值、樹形查詢的標量千萬需要注意改寫前後是否等價
SELECT /*+ NO_USE_HASH(C,B)*/
C.CUST_ACCT_NO,
C.PRIM_ACCT,
ACCOUNT_SYSTEM,
CUSTOMER_TYPE,
CUSTOMER_STATUS,
CREATE_DT,
HOME_BRANCH_NO,
COMPANY_SIZE,
NOTICE_IND,
NOTICE_CUST_NO,
STMT_FREQUENCY,
STMT_CYCLE,
STMT_DAY,
ID_NO,
ID_TYPE,
SHORT_NAME,
EMAIL_ADD1,
EMAIL_ADD2,
CREDIT_RANKING,
TITLE_CODE,
NAME1,
ADD1,
POSTCODE,
PHONE_NO_RES,
PHONE_RES_EXT,
PHONE_NO_BUS,
PHONE_BUS_EXT,
FAX_NO,
TELEX_NO,
PCODE_RGSTER,
REGSTR_ADD1,
REGSTR_ADD2,
PHONE_RGSTR_NO,
PHONE_RGSTR_EXT,
BIRTH_DATE_1,
SEX_CODE,
EMPLOYER_NAME,
EMPLOYED_FROM,
EMPLOYER_ADDR,
OCCUP_DESCRIP,
OCCUPATION_CODE,
INCOME,
INCOME_WMY,
COMPANY_NO,
BUSINESS_NO,
LICENCE_NO,
BOSS_NAME,
BOSS_BDAY,
BUS_RGSTR_DATE,
CAPITAL_AMT,
CONTACT_REL_1,
PHONE_NO_1,
ADD2,
ADD3,
ADD4,
MOBILE_NO,
FXSP_TYPE,
INDUSTRY_CODE,
BUS_SECTOR_CODE,
CUST_SUB_TYPE,
DEP_STMT_TYPE,
ID_ISSUE_DATE,
ID_EXP_DATE,
REGISTRY_ADD,
ID_ISSUE_PLAC,
LST_MNT_DATE,
B.BRANCH_NO
FROM CUSM_T C
INNER JOIN (SELECT DISTINCT CUSTOMER_NO,
sjjgm BRANCH_NO
FROM CB_ACCT LEFT JOIN jgdy ON cb_acct.branch_no=jgm
) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO;
Plan hash value: 2285049241
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 36M| 28G| | 1834K (1)| 00:04:47 |
| 1 | MERGE JOIN | | 36M| 28G| | 1834K (1)| 00:04:47 |
| 2 | SORT JOIN | | 36M| 829M| | 188K (1)| 00:00:30 |
| 3 | VIEW | | 36M| 829M| | 188K (1)| 00:00:30 |
| 4 | HASH UNIQUE | | 36M| 1037M| 1384M| 188K (1)| 00:00:30 |
|* 5 | HASH JOIN RIGHT OUTER | | 36M| 1037M| | 71501 (1)| 00:00:12 |
| 6 | INDEX FULL SCAN | JGDY_IDX3 | 1241 | 9928 | | 1 (0)| 00:00:01 |
| 7 | TABLE ACCESS STORAGE FULL| CB_ACCT2 | 36M| 760M| | 71431 (1)| 00:00:12 |
|* 8 | SORT JOIN | | 19M| 14G| 35G| 1645K (1)| 00:04:18 |
| 9 | TABLE ACCESS STORAGE FULL | CUSM_T | 19M| 14G| | 306K (1)| 00:00:48 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("CB_ACCT"."BRANCH_NO"="JGM"(+))
8 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
filter("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
改寫之後標量消失.SQL運行了7分鐘出結果。但是這個SQL裡面沒有不等值連線,走MERGE JOIN顯然毫無意義。明顯走HASH是最好的選擇
一直不理解SQL上面的/*+ NO_USE_HASH(C,B)*/ 的意義,最後開發迴應說這個 HINT是為了讓SQL走巢狀迴圈,因為走NL比較快。聽到這個理由我也是呵呵了!
這裡我簡單的說一下NL、HASH、SMJ在實際工作中該如何選擇:
巢狀迴圈:
看SQL語句的返回條數 太大的話一般都是錯誤的
看驅動表返回的行數 一般不能超過1w 最好在1k 以內(但是這個取決於伺服器效能,可能效能好的伺服器臨界值超過20w都可行)
看被驅動表的連結列 是否包含在索引裡面 (必須包含在索引裡面)
看到distinct ,group by ,sum()一般不走巢狀迴圈(資料量超級多才去group by)當然資料量少的話也可以走NL
雜湊連線只能用於等值連線
排序合併連線唯一的作用:非等值連線
去掉/*+ NO_USE_HASH(C,B)*/ 之後.SQL運行了30秒便出結果
SELECT
C.CUST_ACCT_NO,
C.PRIM_ACCT,
ACCOUNT_SYSTEM,
CUSTOMER_TYPE,
CUSTOMER_STATUS,
CREATE_DT,
HOME_BRANCH_NO,
COMPANY_SIZE,
NOTICE_IND,
NOTICE_CUST_NO,
STMT_FREQUENCY,
STMT_CYCLE,
STMT_DAY,
ID_NO,
ID_TYPE,
SHORT_NAME,
EMAIL_ADD1,
EMAIL_ADD2,
CREDIT_RANKING,
TITLE_CODE,
NAME1,
ADD1,
POSTCODE,
PHONE_NO_RES,
PHONE_RES_EXT,
PHONE_NO_BUS,
PHONE_BUS_EXT,
FAX_NO,
TELEX_NO,
PCODE_RGSTER,
REGSTR_ADD1,
REGSTR_ADD2,
PHONE_RGSTR_NO,
PHONE_RGSTR_EXT,
BIRTH_DATE_1,
SEX_CODE,
EMPLOYER_NAME,
EMPLOYED_FROM,
EMPLOYER_ADDR,
OCCUP_DESCRIP,
OCCUPATION_CODE,
INCOME,
INCOME_WMY,
COMPANY_NO,
BUSINESS_NO,
LICENCE_NO,
BOSS_NAME,
BOSS_BDAY,
BUS_RGSTR_DATE,
CAPITAL_AMT,
CONTACT_REL_1,
PHONE_NO_1,
ADD2,
ADD3,
ADD4,
MOBILE_NO,
FXSP_TYPE,
INDUSTRY_CODE,
BUS_SECTOR_CODE,
CUST_SUB_TYPE,
DEP_STMT_TYPE,
ID_ISSUE_DATE,
ID_EXP_DATE,
REGISTRY_ADD,
ID_ISSUE_PLAC,
LST_MNT_DATE,
B.BRANCH_NO
FROM CUSM_T C
INNER JOIN (SELECT DISTINCT CUSTOMER_NO,
sjjgm BRANCH_NO
FROM CB_ACCT LEFT JOIN jgdy ON cb_acct.branch_no=jgm
) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO
Plan hash value: 967350049
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 36M| 28G| | 1059K (1)| 00:02:46 |
|* 1 | HASH JOIN | | 36M| 28G| 1244M| 1059K (1)| 00:02:46 |
| 2 | VIEW | | 36M| 829M| | 188K (1)| 00:00:30 |
| 3 | HASH UNIQUE | | 36M| 1037M| 1384M| 188K (1)| 00:00:30 |
|* 4 | HASH JOIN RIGHT OUTER | | 36M| 1037M| | 71501 (1)| 00:00:12 |
| 5 | INDEX FULL SCAN | JGDY_IDX3 | 1241 | 9928 | | 1 (0)| 00:00:01 |
| 6 | TABLE ACCESS STORAGE FULL| CB_ACCT2 | 36M| 760M| | 71431 (1)| 00:00:12 |
| 7 | TABLE ACCESS STORAGE FULL | CUSM_T | 19M| 14G| | 306K (1)| 00:00:48 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
4 - access("CB_ACCT"."BRANCH_NO"="JGM"(+))
其實這個SQL還可以繼續優化,ID=5這一步INDEX FULL SCAN是單塊讀改成全表掃描可以提升100+倍,加上一體機本身的全表掃描優化TABLE ACCESS STORAGE FULL。提升會更多!!!如上可知標量子查詢是一個非常恐怖的用法。當外部表返回的資料量不大時。完全不會引起效能問題。但是此時隱患已經埋下
隨著外部表資料量的增加。標量的效能會慢慢受到影響,一旦過了這個臨界值。效能下降的非常明顯和可怕。所以在資料倉庫
中應該用外連線代替標量,避免給程式埋下隱患。