1. 程式人生 > >資料倉庫設計的隱患-標量子查詢

資料倉庫設計的隱患-標量子查詢

首先,來理解一下標量子查詢:處於select之後from之前的子查詢稱為標量子查詢 .比如:select  num1,cal,(select name from t2 where t2.id = t1.id)from t1;舉這個例子只是為了方便理解標量的含義。當然定義為返回單列的選擇語句,或者返回一行的表示式的子查詢稱為標量子查詢。

標量子查詢的缺點十分明顯:驅動表固定是外表t1, t1返回的結果集傳值t2得到結果。所以如果t1表過大(或者以後t1表逐漸增長以後會變得很大)。將會引起很大的效能問題【資料倉庫跑批流程應該禁用標量子查詢】

今天這條SQL運行了5個小時沒出結果。。。。。。。(不得不說耐心真好,我一般最多等十分鐘)

SELECT /*+ NO_USE_HASH(C,B)*/
 C.CUST_ACCT_NO,
 C.PRIM_ACCT,
 ACCOUNT_SYSTEM,
 CUSTOMER_TYPE,
 CUSTOMER_STATUS,
 CREATE_DT,
 HOME_BRANCH_NO,
 COMPANY_SIZE,
 NOTICE_IND,
 NOTICE_CUST_NO,
 STMT_FREQUENCY,
 STMT_CYCLE,
 STMT_DAY,
 ID_NO,
 ID_TYPE,
 SHORT_NAME,
 EMAIL_ADD1,
 EMAIL_ADD2,
 CREDIT_RANKING,
 TITLE_CODE,
 NAME1,
 ADD1,
 POSTCODE,
 PHONE_NO_RES,
 PHONE_RES_EXT,
 PHONE_NO_BUS,
 PHONE_BUS_EXT,
 FAX_NO,
 TELEX_NO,
 PCODE_RGSTER,
 REGSTR_ADD1,
 REGSTR_ADD2,
 PHONE_RGSTR_NO,
 PHONE_RGSTR_EXT,
 BIRTH_DATE_1,
 SEX_CODE,
 EMPLOYER_NAME,
 EMPLOYED_FROM,
 EMPLOYER_ADDR,
 OCCUP_DESCRIP,
 OCCUPATION_CODE,
 INCOME,
 INCOME_WMY,
 COMPANY_NO,
 BUSINESS_NO,
 LICENCE_NO,
 BOSS_NAME,
 BOSS_BDAY,
 BUS_RGSTR_DATE,
 CAPITAL_AMT,
 CONTACT_REL_1,
 PHONE_NO_1,
 ADD2,
 ADD3,
 ADD4,
 MOBILE_NO,
 FXSP_TYPE,
 INDUSTRY_CODE,
 BUS_SECTOR_CODE,
 CUST_SUB_TYPE,
 DEP_STMT_TYPE,
 ID_ISSUE_DATE,
 ID_EXP_DATE,
 REGISTRY_ADD,
 ID_ISSUE_PLAC,
 LST_MNT_DATE,
 B.BRANCH_NO
  FROM CUSM_T C
 INNER JOIN (SELECT
             DISTINCT CUSTOMER_NO,
                      (SELECT SJJGM
                         FROM JGDY H
                        WHERE H.JGM = CB_ACCT.BRANCH_NO) BRANCH_NO
               FROM CB_ACCT) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 2079508004
 
---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |    18M|    14G|       |  1793K  (1)| 00:04:41 |
|*  1 |  INDEX SKIP SCAN              | JGDY_IDX3 |     1 |     8 |       |     1   (0)| 00:00:01 |
|   2 |  MERGE JOIN                   |           |    18M|    14G|       |  1793K  (1)| 00:04:41 |
|   3 |   SORT JOIN                   |           |    18M|   397M|       |   147K  (1)| 00:00:24 |
|   4 |    VIEW                       |           |    18M|   397M|       |   147K  (1)| 00:00:24 |
|   5 |     HASH UNIQUE               |           |    18M|   380M|  1107M|   147K  (1)| 00:00:24 |
|   6 |      TABLE ACCESS STORAGE FULL| CB_ACCT2  |    36M|   760M|       | 71431   (1)| 00:00:12 |
|*  7 |   SORT JOIN                   |           |    19M|    14G|    35G|  1645K  (1)| 00:04:18 |
|   8 |    TABLE ACCESS STORAGE FULL  | CUSM_T    |    19M|    14G|       |   306K  (1)| 00:00:48 |
---------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("H"."JGM"=:B1)
       filter("H"."JGM"=:B1)
   7 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
       filter("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
通過SQL和PLAN都可以很容易的找出標量子查詢

1.select之後from之前,當然這個SQL的標量隱藏在了內聯視圖裡面了

(SELECT
             DISTINCT CUSTOMER_NO,
                      (SELECT SJJGM
                         FROM JGDY H
                        WHERE H.JGM = CB_ACCT.BRANCH_NO) BRANCH_NO
               FROM CB_ACCT) B

2.PLAN的id=1和2那兩步,縮排一樣,而且沒有連線方式的父親節點

|*  1 |  INDEX SKIP SCAN              | JGDY_IDX3 |     1 |     8 |       |     1   (0)| 00:00:01 |
|   2 |  MERGE JOIN                   |           |    18M|    14G|       |  1793K  (1)| 00:04:41 |

通過這兩點均可以判斷SQL裡面包含標量.如果SQL特別長就直接看PLAN就行了

標量是否產生效能問題,注意取決於主表(外表)返回的行數.其實我們都知道這種資料倉庫跑批的表不可能小。象徵性的查一下


我之前的部落格裡面發過這個指令碼 http://blog.csdn.net/skybig1988/article/details/71125223 也可以自己定製,很簡單

可以看出表的行數很大,不適合走標量(>10000行)

對於標量子查詢,只能通過改寫【標量子查詢可以等價改寫為外連線】

當然此處的標量改寫十分簡單.有些複雜的比如 聚合類、不等值、樹形查詢的標量千萬需要注意改寫前後是否等價

SELECT /*+ NO_USE_HASH(C,B)*/
 C.CUST_ACCT_NO,
 C.PRIM_ACCT,
 ACCOUNT_SYSTEM,
 CUSTOMER_TYPE,
 CUSTOMER_STATUS,
 CREATE_DT,
 HOME_BRANCH_NO,
 COMPANY_SIZE,
 NOTICE_IND,
 NOTICE_CUST_NO,
 STMT_FREQUENCY,
 STMT_CYCLE,
 STMT_DAY,
 ID_NO,
 ID_TYPE,
 SHORT_NAME,
 EMAIL_ADD1,
 EMAIL_ADD2,
 CREDIT_RANKING,
 TITLE_CODE,
 NAME1,
 ADD1,
 POSTCODE,
 PHONE_NO_RES,
 PHONE_RES_EXT,
 PHONE_NO_BUS,
 PHONE_BUS_EXT,
 FAX_NO,
 TELEX_NO,
 PCODE_RGSTER,
 REGSTR_ADD1,
 REGSTR_ADD2,
 PHONE_RGSTR_NO,
 PHONE_RGSTR_EXT,
 BIRTH_DATE_1,
 SEX_CODE,
 EMPLOYER_NAME,
 EMPLOYED_FROM,
 EMPLOYER_ADDR,
 OCCUP_DESCRIP,
 OCCUPATION_CODE,
 INCOME,
 INCOME_WMY,
 COMPANY_NO,
 BUSINESS_NO,
 LICENCE_NO,
 BOSS_NAME,
 BOSS_BDAY,
 BUS_RGSTR_DATE,
 CAPITAL_AMT,
 CONTACT_REL_1,
 PHONE_NO_1,
 ADD2,
 ADD3,
 ADD4,
 MOBILE_NO,
 FXSP_TYPE,
 INDUSTRY_CODE,
 BUS_SECTOR_CODE,
 CUST_SUB_TYPE,
 DEP_STMT_TYPE,
 ID_ISSUE_DATE,
 ID_EXP_DATE,
 REGISTRY_ADD,
 ID_ISSUE_PLAC,
 LST_MNT_DATE,
 B.BRANCH_NO
  FROM CUSM_T C
 INNER JOIN (SELECT DISTINCT CUSTOMER_NO,
                      sjjgm BRANCH_NO
               FROM CB_ACCT  LEFT JOIN jgdy  ON  cb_acct.branch_no=jgm
               ) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO;
Plan hash value: 2285049241
 
----------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |           |    36M|    28G|       |  1834K  (1)| 00:04:47 |
|   1 |  MERGE JOIN                    |           |    36M|    28G|       |  1834K  (1)| 00:04:47 |
|   2 |   SORT JOIN                    |           |    36M|   829M|       |   188K  (1)| 00:00:30 |
|   3 |    VIEW                        |           |    36M|   829M|       |   188K  (1)| 00:00:30 |
|   4 |     HASH UNIQUE                |           |    36M|  1037M|  1384M|   188K  (1)| 00:00:30 |
|*  5 |      HASH JOIN RIGHT OUTER     |           |    36M|  1037M|       | 71501   (1)| 00:00:12 |
|   6 |       INDEX FULL SCAN          | JGDY_IDX3 |  1241 |  9928 |       |     1   (0)| 00:00:01 |
|   7 |       TABLE ACCESS STORAGE FULL| CB_ACCT2  |    36M|   760M|       | 71431   (1)| 00:00:12 |
|*  8 |   SORT JOIN                    |           |    19M|    14G|    35G|  1645K  (1)| 00:04:18 |
|   9 |    TABLE ACCESS STORAGE FULL   | CUSM_T    |    19M|    14G|       |   306K  (1)| 00:00:48 |
----------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("CB_ACCT"."BRANCH_NO"="JGM"(+))
   8 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
       filter("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
改寫之後標量消失.SQL運行了7分鐘出結果。但是這個SQL裡面沒有不等值連線,走MERGE JOIN顯然毫無意義。明顯走HASH是最好的選擇

一直不理解SQL上面的/*+ NO_USE_HASH(C,B)*/ 的意義,最後開發迴應說這個 HINT是為了讓SQL走巢狀迴圈,因為走NL比較快。聽到這個理由我也是呵呵了!

這裡我簡單的說一下NL、HASH、SMJ在實際工作中該如何選擇:

巢狀迴圈:
     看SQL語句的返回條數 太大的話一般都是錯誤的
     看驅動表返回的行數   一般不能超過1w   最好在1k 以內(但是這個取決於伺服器效能,可能效能好的伺服器臨界值超過20w都可行)
     看被驅動表的連結列 是否包含在索引裡面     (必須包含在索引裡面)
     看到distinct ,group by ,sum()一般不走巢狀迴圈(資料量超級多才去group by)當然資料量少的話也可以走NL

雜湊連線只能用於等值連線

排序合併連線唯一的作用:非等值連線

去掉/*+ NO_USE_HASH(C,B)*/ 之後.SQL運行了30秒便出結果

SELECT
 C.CUST_ACCT_NO,
 C.PRIM_ACCT,
 ACCOUNT_SYSTEM,
 CUSTOMER_TYPE,
 CUSTOMER_STATUS,
 CREATE_DT,
 HOME_BRANCH_NO,
 COMPANY_SIZE,
 NOTICE_IND,
 NOTICE_CUST_NO,
 STMT_FREQUENCY,
 STMT_CYCLE,
 STMT_DAY,
 ID_NO,
 ID_TYPE,
 SHORT_NAME,
 EMAIL_ADD1,
 EMAIL_ADD2,
 CREDIT_RANKING,
 TITLE_CODE,
 NAME1,
 ADD1,
 POSTCODE,
 PHONE_NO_RES,
 PHONE_RES_EXT,
 PHONE_NO_BUS,
 PHONE_BUS_EXT,
 FAX_NO,
 TELEX_NO,
 PCODE_RGSTER,
 REGSTR_ADD1,
 REGSTR_ADD2,
 PHONE_RGSTR_NO,
 PHONE_RGSTR_EXT,
 BIRTH_DATE_1,
 SEX_CODE,
 EMPLOYER_NAME,
 EMPLOYED_FROM,
 EMPLOYER_ADDR,
 OCCUP_DESCRIP,
 OCCUPATION_CODE,
 INCOME,
 INCOME_WMY,
 COMPANY_NO,
 BUSINESS_NO,
 LICENCE_NO,
 BOSS_NAME,
 BOSS_BDAY,
 BUS_RGSTR_DATE,
 CAPITAL_AMT,
 CONTACT_REL_1,
 PHONE_NO_1,
 ADD2,
 ADD3,
 ADD4,
 MOBILE_NO,
 FXSP_TYPE,
 INDUSTRY_CODE,
 BUS_SECTOR_CODE,
 CUST_SUB_TYPE,
 DEP_STMT_TYPE,
 ID_ISSUE_DATE,
 ID_EXP_DATE,
 REGISTRY_ADD,
 ID_ISSUE_PLAC,
 LST_MNT_DATE,
 B.BRANCH_NO
  FROM CUSM_T C
 INNER JOIN (SELECT DISTINCT CUSTOMER_NO,
                      sjjgm BRANCH_NO
               FROM CB_ACCT  LEFT JOIN jgdy  ON  cb_acct.branch_no=jgm
               ) B ON C.CUST_ACCT_NO = B.CUSTOMER_NO

Plan hash value: 967350049
 
---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |    36M|    28G|       |  1059K  (1)| 00:02:46 |
|*  1 |  HASH JOIN                    |           |    36M|    28G|  1244M|  1059K  (1)| 00:02:46 |
|   2 |   VIEW                        |           |    36M|   829M|       |   188K  (1)| 00:00:30 |
|   3 |    HASH UNIQUE                |           |    36M|  1037M|  1384M|   188K  (1)| 00:00:30 |
|*  4 |     HASH JOIN RIGHT OUTER     |           |    36M|  1037M|       | 71501   (1)| 00:00:12 |
|   5 |      INDEX FULL SCAN          | JGDY_IDX3 |  1241 |  9928 |       |     1   (0)| 00:00:01 |
|   6 |      TABLE ACCESS STORAGE FULL| CB_ACCT2  |    36M|   760M|       | 71431   (1)| 00:00:12 |
|   7 |   TABLE ACCESS STORAGE FULL   | CUSM_T    |    19M|    14G|       |   306K  (1)| 00:00:48 |
---------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("C"."CUST_ACCT_NO"="B"."CUSTOMER_NO")
   4 - access("CB_ACCT"."BRANCH_NO"="JGM"(+))
其實這個SQL還可以繼續優化,ID=5這一步INDEX FULL SCAN是單塊讀改成全表掃描可以提升100+倍,加上一體機本身的全表掃描優化TABLE ACCESS STORAGE FULL。提升會更多!!!

如上可知標量子查詢是一個非常恐怖的用法。當外部表返回的資料量不大時。完全不會引起效能問題。但是此時隱患已經埋下

隨著外部表資料量的增加。標量的效能會慢慢受到影響,一旦過了這個臨界值。效能下降的非常明顯和可怕。所以在資料倉庫

中應該用外連線代替標量,避免給程式埋下隱患。