1. 程式人生 > 實用技巧 >Oracle資料泵匯出使用並行引數,單個表能否真正的並行?

Oracle資料泵匯出使用並行引數,單個表能否真正的並行?

對於Oracle 資料泵expdp,impdp是一種邏輯匯出匯入遷移資料的一個工具,是服務端的工具,常見於DBA人員使用,用於資料遷移。從A庫遷移至B庫,或者從A使用者遷移至B使用者等。

那麼有個疑問?

在Oracle 11.2.0.4的版本,對一個表,使用並行引數,是否真實的起用了並行?假設並行為2,是否真的分2個程序,一個程序負責匯出一半的資料???

1.測試匯出兩個不同的表,使用並行2

$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel=2 cluster=n tables=dept,emp
Total estimation 
using BLOCKS method: 128 KB . . exported "SCOTT"."DEPT" 5.929 KB 4 rows . . exported "SCOTT"."EMP" 8.562 KB 14 rows Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is: /home/oracle/D01.dmp /home/oracle/D02.dmp
這兩個檔案,一個大一個小,根本不是大小相等的檔案格式。 並且dump檔案是二進位制檔案,無法觀察裡面具體存放什麼資訊,因此直接讀取dump檔案觀察這條路不通。

2.閱讀MOS文件

Parallel Capabilities of Oracle Data Pump (Doc ID 365459.1)    
  
For every export operation, Data Pump estimates how much disk space each table data object in the export job will consume (in bytes).
This is done whether or not the user uses the ESTIMATE parameter. The estimate is printed in
the log file and displayed on the client's
standard output device. The estimate is for table row data only; it does not include metadata. This estimate is used to determine how many
PX processes will be applied to the table data object, if any.
對於每個匯出的表來說,會估算匯出大小,只包含表的行記錄對應預估的大小。並且以此評估真正使用並行時,需要使用多少個程序? The columns of the tables are examined to determine if direct path, external tables, or both methods can be used. For direct path,
the parallel number for the table data object is always one since direct path does not support parallel unload of a table data object.
PX processes are only used with external tables. Oracle匯出兩種方式,直接路徑讀 or 外部表,直接路徑讀並行一直1,外部表才允許並行匯出。 If the external tables method
is chosen, Data Pump will determine the maximum number of PX processes that can work on a table data object.
It does this by dividing the estimated size of the table data object by 250 MB and rounding the result down. If the result is zero or one,
then PX processes are not used to unload the table. 如果選擇了外部表方法,則資料泵將確定可在表資料物件上執行的PX程序的最大數量。它通過將表資料物件的估計大小除以250 MB並將結果四捨五入來實現。如果結果為零或一,則不使用PX程序解除安裝表。 If a job
is not big enough to make use of the maximum parallel number, then the user will not see the maximum number of active workers and
Parallel Execution Processes. For example, if there is one 800 MB table, and it has been determined that external tables will be used,
there will be one worker for the metadata, one worker for the data, and three PX processes. As mentioned above, the worker process for
the data acts as the coordinator for the PX processes and does not count toward the parallel total. So, if a user specifies PARALLEL = 10,
the degree of parallelism is actually four. The user will only see one active worker in the STATUS display. Data Pump is working optimally;
the job is too small for the specified degree of parallelism.

這段話不太理解,有點懵,後續測試下。

3.模擬一個800M的非分割槽表,並行使用10個並行匯出,結果如何?

疑問? 表匯出,是根據表統計資訊估算大小? 還是根據DBA_SEGMENTS 估算?

SQL> select sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='A';
SUM(BYTES)/1024/1024
--------------------
                 824

SQL> select owner,table_name,NUM_ROWS,BLOCKS*8/1024,SAMPLE_SIZE from dba_tables where owner='SCOTT' and table_name='A';

OWNER TABLE_NAME NUM_ROWS BLOCKS*8/1024 SAMPLE_SIZE

---------- ------------------------------ ---------- ------------- -----------
SCOTT A

匯出

$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel=10 tables=a
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 824 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "SCOTT"."A"                                 708.3 MB 7315680 rows
******************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
  /u01/dump/D01.dmp
  /u01/dump/D02.dmp
  /u01/dump/D03.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:03:58 2020 elapsed 0 00:00:11
$ ls -lrt D*.dmp
-rw------- 1 oracle oinstall 245936128 Aug  5 06:03 D03.dmp
-rw------- 1 oracle oinstall 248098816 Aug  5 06:03 D02.dmp
-rw------- 1 oracle oinstall 248860672 Aug  5 06:03 D01.dmp


$expdp \'/ as sysdba\' attach=SYS_EXPORT_TABLE_01 
Job: SYS_EXPORT_TABLE_01
  Owner: SCOTT                          
  Operation: EXPORT                         
  Creator Privs: TRUE                           
  GUID: AC156DF4AC940F1DE053453CA8C0F1FA
  Start Time: Wednesday, 05 August, 2020 6:03:48
  Mode: TABLE                          
  Instance: tt
  Max Parallelism: 10
  EXPORT Job Parameters:
  Parameter Name      Parameter Value:
     CLIENT_COMMAND        scott/******** directory=dump dumpfile=D%U.dmp parallel=10 tables=a 
  State: EXECUTING                      
  Bytes Processed: 0
  Current Parallelism: 10
  Job Error Count: 0
  Dump File: /u01/dump/D01.dmp
    bytes written: 4,096
  Dump File: /u01/dump/D%u.dmp
  
Worker 1 Status:
  Process Name: DW00
  State: EXECUTING                      
  Object Schema: SCOTT
  Object Name: A
  Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 1
  Total Objects: 1
  Worker Parallelism: 1
  
Worker 2 Status:
  Process Name: DW01
  State: EXECUTING                      
  Object Schema: SCOTT
  Object Name: A
  Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 1
  Total Objects: 1
  Worker Parallelism: 3

自問自答:資料泵這個值說是預估,實際上是直接使用DBA_SEGMENTS裡面的資料。
Total estimation using BLOCKS method: 824 MB

並且此時在結合MOS最後一段話的理解,並行引數使用10,但是實際上相當於使用了兩個並行主程序,一個程序負責匯出元資料,匯出完成元資料,在幫忙匯出一點資料;
另一個程序主要負責導輸出,根據演算法,出現3個輔助程序,加快匯出! 824/250=3 四捨五入。
這也側面驗證了對於非分割槽表,實際上並行匯出,並不會多出幾個程序並行。
疑問又來了??? 那麼對於單表的匯出,使用並行引數是否真的快? 還是慢?
並行10 ,11秒執行。

$ expdp scott/tiger directory=dump dumpfile=D1%U.dmp tables=a

******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D101.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:12:40 2020 elapsed 0 00:00:15

15s的時間,與11s相比,差距並不大。 因此對於單表來說,其實並行並不會真正意義上明細加快速度。

4.模擬分割槽表,800M資料,使用並行引數匯出時間如何?

create table RANGE_PART_TAB(id number,
    deal_date date, area_code number, contents varchar2(4000))
 partition by range(deal_date)
 (
 partition p1 values less  than(to_date('2020-02-01','yyyy-mm-dd')),
 partition p2 values less  than(to_date('2020-03-01','yyyy-mm-dd')),
 partition p3 values less  than(to_date('2020-04-01','yyyy-mm-dd')),
 partition p4 values less  than(to_date('2020-05-01','yyyy-mm-dd')),
 partition p5 values less  than(to_date('2020-06-01','yyyy-mm-dd')),
 partition p6 values less  than(to_date('2020-07-01','yyyy-mm-dd')),
 partition p7 values less  than(to_date('2020-08-01','yyyy-mm-dd')),
 partition p8 values less  than(to_date('2020-09-01','yyyy-mm-dd')),
 partition p9 values less  than(to_date('2020-10-01','yyyy-mm-dd')),
 partition p10 values less than(to_date('2020-11-01','yyyy-mm-dd')));
 
 insert into range_part_tab (id,deal_date,area_code,contents)
 select rownum,
          to_date(to_char(to_date('20200101','yyyymmdd'),'J')+ trunc(dbms_random.value(0,300)),'J'),
ceil(dbms_random.value(590,599)),
 rpad('*',400,'*')
 from dual
 connect by rownum <= 100000;
SQL> r   多次自插入資料
  1* insert into range_part_tab select * from range_part_tab
800000 rows created.
SQL> commit;
 SQL> select sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB';
SUM(BYTES)/1024/1024
--------------------
                 792
SQL>  select PARTITION_NAME,sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB' group by
PARTITION_NAME order by 1; PARTITION_NAME SUM(BYTES)/1024/1024 ------------------------------ -------------------- P1 80 P10 72 P2 80 P3 80 P4 80 P5 80 P6 80 P7 80 P8 80 P9 80 10 rows selected. [oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D2%U.dmp tables=RANGE_PART_TAB parallel=10 Worker 1 Status: Process Name: DW00 State: EXECUTING Object Schema: SCOTT Object Name: RANGE_PART_TAB Object Type: TABLE_EXPORT/TABLE/TABLE_DATA Completed Objects: 10 Total Objects: 10 Completed Rows: 138,592 Completed Bytes: 58,754,176 Percent Done: 77 Worker Parallelism: 1 Worker 2 Status: Process Name: DW01 State: WORK WAITING ****************************************************************************** Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is: /u01/dump/D201.dmp /u01/dump/D202.dmp /u01/dump/D203.dmp /u01/dump/D204.dmp /u01/dump/D205.dmp /u01/dump/D206.dmp /u01/dump/D207.dmp /u01/dump/D208.dmp /u01/dump/D209.dmp Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:26:04 2020 elapsed 0 00:00:22 [oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D3%U.dmp tables=RANGE_PART_TAB Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:28:14 2020 elapsed 0 00:00:20 !!!

測試了一個寂寞,根本不是想象中的十個不同的主程序,每個人負責一個分割槽,而還是一個主程序,序列化匯出每個分割槽! 很傻很天真。

??? 難道是對於每個Segments 大小 250M的限制???

對測試分割槽表擴大幾倍。

666 果然是250MB的閾值,當單個分割槽SEGMENTS大於250M,才真正的開始了並行的作用,匯出存在10個主程序匯出。


SQL> select PARTITION_NAME,sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB' group by PARTITION_NAME;
PARTITION_NAME SUM(BYTES)/1024/1024
------------------------------ --------------------
P7 309
P4 304
P1 312
P2 288
P3 304
P6 296
P10 264
P9 300
P8 312
P5 312

10 rows selected.
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D4%U.dmp tables=RANGE_PART_TAB parallel=10
Worker 1 Status:
Process Name: DW00
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 2 Status:
Process Name: DW01
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 3 Status:
Process Name: DW02
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 4 Status:
Process Name: DW03
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 5 Status:
Process Name: DW04
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 6 Status:
Process Name: DW05
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 7 Status:
Process Name: DW06
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 8 Status:
Process Name: DW07
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 9 Status:
Process Name: DW08
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 10 Status:
Process Name: DW09
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D401.dmp
/u01/dump/D402.dmp
/u01/dump/D403.dmp
/u01/dump/D404.dmp
/u01/dump/D405.dmp
/u01/dump/D406.dmp
/u01/dump/D407.dmp
/u01/dump/D408.dmp
/u01/dump/D409.dmp
/u01/dump/D410.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:37:20 2020 elapsed 0 00:00:40
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D5%U.dmp tables=RANGE_PART_TAB

測試下時間對比。

Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:39:14 2020 elapsed 0 00:01:21

並行10,非並行時間 40s:80s=1:2 時間上減少了很多。資料越大,效率越明顯。

總結:1.對於單個表而言,表小於250M,並行無任何意義,都是1;

2.對單個表來說,並行度2 ,基本上都是第一個程序匯出元資料,第二個匯出資料,第一個匯出元資料的程序匯出後,資料未導完畢,會繼續往dump灌資料,但是幾乎沒有並行效果;

3.對於非分割槽表來說,表很大,並行開很多,僅僅只是匯出時輔助程序多幾個,效率這塊未驗證,但是初步對比,沒啥效果;

4.對於分割槽表來說,表很大,單個分割槽大於250M,並行很多才有用,分割槽都很小,可以忽略等同於無並行。 資料量越大,並行才能真正使用。