Park's Oracle Tuning: Performance overhead in parallel direct path insert with non-partitioned index

Recently I performed a simple test that in case of partitioned or non-partitioned index exists how much the performance difference will be there when a parallel direct path insert job was performed.
To be precise for the test, of course, there must be a exact calculation on the numbers of statistics of v$sysstat, but I choose Deacon Session monitor to just briefly monitor the operation. Because that will be enough to distinguish the difference.

-------------------------
--Create the Test Tables
-------------------------
--source table
drop table emp_src;
create table emp_src as
select * from emp
;
insert into emp_src
select * from emp_src
;
--execute several times

commit;

--confirm the segment size
SQL > @lseg emp_src

SEGMENT_NAME SEGMENT_TYPE BLOCKS [MB/8192]
------------------------------ -------------------- ------------ ----------
EMP_SRC TABLE 97,792 764

--target table
drop table emp_tg1;
drop table emp_tg2;
drop table emp_tg3;

create table emp_tg1 partition by hash(empno) partitions 8 nologging as
select * from emp_src where rownum <= 1000000
;
create table emp_tg2 partition by hash(empno) partitions 8 nologging as
select * from emp_src where rownum <= 1000000
;
create table emp_tg3 partition by hash(empno) partitions 8 nologging as
select * from emp_src where rownum <= 1000000
;

-------------------------------------------
--Create the Test Indexes and execute load
-------------------------------------------
--case 1. partitioned local index
drop index emp_tg1_n1;
create index emp_tg1_n1 on emp_tg1(empno)
local
nologging;

alter session enable parallel dml;

insert /*+parallel(8) */ into emp_tg1
select * from emp_src;
commit;

--case 2. non-partitioned index
drop index emp_tg2_n1;
create index emp_tg2_n1 on emp_tg2(empno)
nologging;

alter session enable parallel dml;

insert /*+parallel(8) */ into emp_tg2
select * from emp_src;
commit;

--case 3. partitioned global index
drop index emp_tg3_n1;
create index emp_tg3_n1 on emp_tg3(empno)
global partition by hash(empno) partitions 8
nologging;

alter session enable parallel dml;

insert /*+parallel(8) */ into emp_tg3
select * from emp_src;
commit;

--All 3 cases are same in execution plan except the segment names

Execution Plan
-----------------------------------------------------------
0 INSERT STATEMENT Optimizer=FIRST_ROWS (Cost=3 Card=572 Bytes=49K)
1 0 PX COORDINATOR
2 1 PX SEND (QC (RANDOM)) OF 'SYS.:TQ10002' (Cost=3 Card=572 Bytes=49K) (PARALLEL_TO_SERIAL) (QC (RANDOM))
3 2 INDEX MAINTENANCE OF 'EMP_TG1' (PARALLEL_COMBINED_WITH_PARENT)
4 3 PX RECEIVE (Cost=3 Card=572 Bytes=49K) (PARALLEL_COMBINED_WITH_PARENT)
5 4 PX SEND (RANGE) OF 'SYS.:TQ10001' (Cost=3 Card=572 Bytes=49K) (PARALLEL_TO_PARALLEL) (RANGE)
6 5 LOAD AS SELECT OF 'EMP_TG1' (PARALLEL_COMBINED_WITH_PARENT)
7 6 PX RECEIVE (Cost=3 Card=572 Bytes=49K) (PARALLEL_COMBINED_WITH_PARENT)
8 7 PX SEND (ROUND-ROBIN) OF 'SYS.:TQ10000' (Cost=3 Card=572 Bytes=49K) (PARALLEL_FROM_SERIAL) (ROUND-ROBIN)
9 8 TABLE ACCESS (FULL) OF 'EMP_SRC' (TABLE) (Cost=3 Card=572 Bytes=49K)
-----------------------------------------------------------

----------------------
Results of Monitoring
----------------------
Note. Below are the results of Deacon session monitor that was bit edited for better view.
And individual lines are printed in every 5 seconds.

--case 1. partitioned local index
--------------------------------------------------------------------------
PXs | Exec Logrds Phyrds Drrds Phywts Drwts PGA[m] Rdo[m] Lfspl
-------------m-------m-------m--------------------------------------------
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0
16 | 5 1037 0 0 213 213 106 0 0
16 | 9 5669 0 0 3613 3613 88 0 0
16 | 9 5903 0 0 3633 3633 102 0 0
16 | 9 5738 0 0 3713 3713 109 0 0
16 | 0 6 14541 14540 4394 4394 227 0 0
16 | 0 12 3510 3510 3592 3592 76 0 0
0 | 58 201014 6122 6105 0 0 3 166.5 6559
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0

--case 2. non-partitioned index
--------------------------------------------------------------------------
PXs | Exec Logrds Phyrds Drrds Phywts Drwts PGA[m] Rdo[m] Lfspl
-------------m-------m-------m--------------------------------------------
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0
16 | 10 4218 1 0 2401 2401 101 0 0
16 | 9 5740 0 0 3663 3663 93 0 0
16 | 7 4644 0 0 2880 2880 104 0 0
16 | 9 5794 0 0 3662 3662 99 0 0
16 | 0 2 2255 2255 1530 1530 57 0 0
16 | 0 4 1200 1200 1189 1189 67 0 0
16 | 0 4 1275 1275 859 859 136 0 0
16 | 0 5 1410 1410 1871 1871 85 0 0
16 | 16 31360 12544 12526 11736 11736 55 28.3 1115
0 | 20 99631 2782 2716 0 0 3 81.9 3235
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0

--case 3. partitioned global index
--------------------------------------------------------------------------
PXs | Exec Logrds Phyrds Drrds Phywts Drwts PGA[m] Rdo[m] Lfspl
-------------m-------m-------m--------------------------------------------
0 | 0 0 0 0 0 0 3 0 0
0 | 0 0 0 0 0 0 3 0 0
16 | 7 2369 2 0 1159 1159 92 0 0
16 | 8 5534 0 0 3622 3622 97 0 0
16 | 8 5346 0 0 3351 3351 105 0 0
16 | 9 5883 0 0 3874 3874 96 0 0
16 | 4 2131 17886 17886 17163 17163 59 0 0
16 | 0 16 5266 5266 5323 5323 79 0 0
16 | 71 107435 14597 14411 11131 11131 57 98.5 3845
0 | 3 11793 378 332 0 0 3 10.1 398
0 | 0 0 0 0 0 0 3 0 0

For this is a direct path insert and the target table is Nologging mode, Redo are generated only when index 'leaf node splits' (Lfspl) occur in Index Maintenance phase. And the sort operation must be completed before this write operation. Probably the sort operation started when 'Physical reads direct' (Drrds) begins to be appear in the monitoring.

As expected, case. 2 is the longest. Probably because the number of the index segment is only one, so sort and write operation have to be done with one process, whereas in other two test cases the same jobs could be done with 8 processes.

Sunday, November 23, 2014

Performance overhead in parallel direct path insert with non-partitioned index

No comments:

Post a Comment