Oracle DBA – A lifelong learning experience

Posts Tagged ‘parallel DML’

Baselines – session creating privs v session running privs

Posted by John Hallas on November 21, 2017

A colleague Richard Wilkinson was telling me about an issue he had come across with baselines and I asked him to write it up as it was an interesting experience.

The following MERGE SQL runs once a day. It always uses the same plan and roughly takes between 30 and 80 minutes:

SQL_ID/ PHV 6zs5dk6t6pkfs / 3064471754

 MERGE INTO DWB_X_PLNGRM_ATTR_DAY t
 USING (
 SELECT ilv.bsns_unit_cd,
 ilv.sku_item_nbr,
 ilv.cntnt_cd ,
 ilv.sel_units_cntnt_status,
 ilv.plngrm_item_grp_cd,
 dwr.bay_cnt_key,
 dwr.bay_cnt
 FROM (
 SELECT DISTINCT bsns_unit_cd,
 sku_item_nbr,
 cntnt_cd ,
 sel_units_cntnt_status,
 plngrm_item_grp_cd,
 --bay_cnt_key,
 max(bay_cnt_key) bay_cnt_key
 --bay_cnt
 FROM dwb_x_sel_item_units_day
 WHERE (1=1)
 AND day_key >= :v_daywkey
 GROUP BY bsns_unit_cd, sku_item_nbr, cntnt_cd, sel_units_cntnt_status, plngrm_item_grp_cd
 ) ilv,
 dwr_x_plngrm_loc_bay_cnt dwr
 WHERE ilv.bay_cnt_key = dwr.bay_cnt_key
 ) s
 ON (
 t.CNTNT_CD = s.CNTNT_CD
 AND t.SKU_ITEM_NBR = s.SKU_ITEM_NBR
 AND t.BSNS_UNIT_CD = s.BSNS_UNIT_CD
 AND t.PLNGRM_CNTNT_STATUS = s.SEL_UNITS_CNTNT_STATUS
 AND t.PLNGRM_ITEM_GRP_CD = s.PLNGRM_ITEM_GRP_CD
 --and t.DAY_KEY >= :v_daywkey
 )
 WHEN MATCHED
 THEN UPDATE SET
 t.bay_cnt_key = s.bay_cnt_key,
 t.bay_cnt = s.bay_cnt;
Plan hash value: 3064471754 
 
---------------------------------------------------------------------------------- 
| Id | Operation | Name | 
---------------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | PX COORDINATOR | | 
| 2 | PX SEND QC (RANDOM) | :TQ10007 | 
| 3 | INDEX MAINTENANCE | DWB_X_PLNGRM_ATTR_DAY | 
| 4 | PX RECEIVE | | 
| 5 | PX SEND RANGE | :TQ10006 | 
| 6 | MERGE | DWB_X_PLNGRM_ATTR_DAY | 
| 7 | PX RECEIVE | | 
| 8 | PX SEND HYBRID (ROWID PKEY) | :TQ10005 | 
| 9 | VIEW | | 
| 10 | HASH JOIN BUFFERED | | 
| 11 | PX RECEIVE | | 
| 12 | PX SEND HASH | :TQ10003 | 
| 13 | HASH JOIN BUFFERED | | 
| 14 | PX RECEIVE | | 
| 15 | PX SEND HASH | :TQ10001 | 
| 16 | PX BLOCK ITERATOR | | 
| 17 | TABLE ACCESS STORAGE FULL | DWR_X_PLNGRM_LOC_BAY_CNT | 
| 18 | PX RECEIVE | | 
| 19 | PX SEND HASH | :TQ10002 | 
| 20 | VIEW | | 
| 21 | SORT GROUP BY | | 
| 22 | PX RECEIVE | | 
| 23 | PX SEND HASH | :TQ10000 | 
| 24 | PX BLOCK ITERATOR | | 
| 25 | TABLE ACCESS STORAGE FULL| DWB_X_SEL_ITEM_UNITS_DAY | 
| 26 | PX RECEIVE | | 
| 27 | PX SEND HASH | :TQ10004 | 
| 28 | PX BLOCK ITERATOR | | 
| 29 | TABLE ACCESS STORAGE FULL | DWB_X_PLNGRM_ATTR_DAY | 
----------------------------------------------------------------------------------

One day this sql changes execution plan and runs for over 5 hours before being killed.

The new plan is 4185795484:

----------------------------------------------------------------------------- 
| Id | Operation | Name | 
----------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | MERGE | DWB_X_PLNGRM_ATTR_DAY | 
| 2 | PX COORDINATOR | | 
| 3 | PX SEND QC (RANDOM) | :TQ10005 | 
| 4 | VIEW | | 
| 5 | HASH JOIN BUFFERED | | 
| 6 | PX RECEIVE | | 
| 7 | PX SEND HASH | :TQ10003 | 
| 8 | PX BLOCK ITERATOR | | 
| 9 | TABLE ACCESS STORAGE FULL | DWR_X_PLNGRM_LOC_BAY_CNT | 
| 10 | PX RECEIVE | | 
| 11 | PX SEND HASH | :TQ10004 | 
| 12 | HASH JOIN BUFFERED | | 
| 13 | PX RECEIVE | | 
| 14 | PX SEND HASH | :TQ10001 | 
| 15 | VIEW | | 
| 16 | SORT GROUP BY | | 
| 17 | PX RECEIVE | | 
| 18 | PX SEND HASH | :TQ10000 | 
| 19 | PX BLOCK ITERATOR | | 
| 20 | TABLE ACCESS STORAGE FULL| DWB_X_SEL_ITEM_UNITS_DAY | 
| 21 | PX RECEIVE | | 
| 22 | PX SEND HASH | :TQ10002 | 
| 23 | PX BLOCK ITERATOR | | 
| 24 | TABLE ACCESS STORAGE FULL | DWB_X_PLNGRM_ATTR_DAY | 
-----------------------------------------------------------------------------

The main difference I can see is that the tables are being joined in a different order.

3064471754 (good): DWB_X_SEL_ITEM_UNITS_DAY joined to DWR_X_PLNGRM_LOC_BAY_CNT then on to the target of the MERGE, DWB_X_PLNGRM_ATTR_DAY.
4185795484 (bad): DWB_X_SEL_ITEM_UNITS_DAY joined to DWB_X_PLNGRM_ATTR_DAY then to DWR_X_PLNGRM_LOC_BAY_CNT.

The fix is simple, fix the good plan that has always been used (3064471754) using SQL baselines.

It looks like this statement already has a few couple of baselined plans , neither are fixed.

sql handle: SQL_0e4f45d7832dd5ab
SQL_PLAN_0wmu5uy1kvpdb8c1d711e 10-NOV-17 10.38.07.000000000 MANUAL-LOAD
SQL_PLAN_0wmu5uy1kvpdb1eaf5716 29-SEP-17 23.16.31.000000000 AUTO-CAPTURE

Looking at the detailed view of the baseline shows that one of the plans is the bad plan and the other has no history of use in ASH:

select * 
 from table(dbms_xplan.display_sql_plan_baseline(sql_handle => 'SQL_0e4f45d7832dd5ab',
 format => 'basic')); --detailed view


PLAN_TABLE_OUTPUT 
----------------------------------------------------------------------------- 
SQL handle: SQL_0e4f45d7832dd5ab 
SQL text: MERGE INTO DWB_X_PLNGRM_ATTR_DAY T USING ( SELECT ILV.BSNS_UNIT_CD, 
ILV.SKU_ITEM_NBR, ILV.CNTNT_CD , ILV.SEL_UNITS_CNTNT_STATUS, 
ILV.PLNGRM_ITEM_GRP_CD, DWR.BAY_CNT_KEY, DWR.BAY_CNT FROM ( SELECT 
DISTINCT BSNS_UNIT_CD, SKU_ITEM_NBR, CNTNT_CD , SEL_UNITS_CNTNT_STATUS, 
PLNGRM_ITEM_GRP_CD, MAX(BAY_CNT_KEY) BAY_CNT_KEY FROM 
DWB_X_SEL_ITEM_UNITS_DAY WHERE (1=1) AND DAY_KEY >= :B1 GROUP BY 
BSNS_UNIT_CD, SKU_ITEM_NBR, CNTNT_CD, SEL_UNITS_CNTNT_STATUS, 
PLNGRM_ITEM_GRP_CD ) ILV, DWR_X_PLNGRM_LOC_BAY_CNT DWR WHERE 
ILV.BAY_CNT_KEY = DWR.BAY_CNT_KEY ) S ON ( T.CNTNT_CD = S.CNTNT_CD AND 
T.SKU_ITEM_NBR = S.SKU_ITEM_NBR AND T.BSNS_UNIT_CD = S.BSNS_UNIT_CD AND 
T.PLNGRM_CNTNT_STATUS = S.SEL_UNITS_CNTNT_STATUS AND 
T.PLNGRM_ITEM_GRP_CD = S.PLNGRM_ITEM_GRP_CD ) WHEN MATCHED THEN UPDATE 
SET T.BAY_CNT_KEY = S.BAY_CNT_KEY, T.BAY_CNT = S.BAY_CNT 
-------------------------------------------------------------------------------- 
 
-------------------------------------------------------------------------------- 
Plan name: SQL_PLAN_0wmu5uy1kvpdb1eaf5716 Plan id: 514807574 
Enabled: YES Fixed: NO Accepted: NO Origin: AUTO-CAPTURE 
-------------------------------------------------------------------------------- 
 
Plan hash value: 4185795484 /******BAD PLAN*******/ 
 
----------------------------------------------------------------------------- 
| Id | Operation | Name | 
----------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | MERGE | DWB_X_PLNGRM_ATTR_DAY | 
| 2 | PX COORDINATOR | | 
| 3 | PX SEND QC (RANDOM) | :TQ10005 | 
| 4 | VIEW | | 
| 5 | HASH JOIN BUFFERED | | 
| 6 | PX RECEIVE | | 
| 7 | PX SEND HASH | :TQ10003 | 
| 8 | PX BLOCK ITERATOR | | 
| 9 | TABLE ACCESS STORAGE FULL | DWR_X_PLNGRM_LOC_BAY_CNT | 
| 10 | PX RECEIVE | | 
| 11 | PX SEND HASH | :TQ10004 | 
| 12 | HASH JOIN BUFFERED | | 
| 13 | PX RECEIVE | | 
| 14 | PX SEND HASH | :TQ10001 | 
| 15 | VIEW | | 
| 16 | SORT GROUP BY | | 
| 17 | PX RECEIVE | | 
| 18 | PX SEND HASH | :TQ10000 | 
| 19 | PX BLOCK ITERATOR | | 
| 20 | TABLE ACCESS STORAGE FULL| DWB_X_SEL_ITEM_UNITS_DAY | 
| 21 | PX RECEIVE | | 
| 22 | PX SEND HASH | :TQ10002 | 
| 23 | PX BLOCK ITERATOR | | 
| 24 | TABLE ACCESS STORAGE FULL | DWB_X_PLNGRM_ATTR_DAY | 
----------------------------------------------------------------------------- 
 
-------------------------------------------------------------------------------- 
Plan name: SQL_PLAN_0wmu5uy1kvpdb8c1d711e Plan id: 2350739742 
Enabled: YES Fixed: NO Accepted: YES Origin: MANUAL-LOAD 
-------------------------------------------------------------------------------- 
 
Plan hash value: 1932271691 /**********no record of use in ASH*********/ 
 
----------------------------------------------------------------------------- 
| Id | Operation | Name | 
----------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | MERGE | DWB_X_PLNGRM_ATTR_DAY | 
| 2 | PX COORDINATOR | | 
| 3 | PX SEND QC (RANDOM) | :TQ10005 | 
| 4 | VIEW | | 
| 5 | HASH JOIN BUFFERED | | 
| 6 | PX RECEIVE | | 
| 7 | PX SEND HASH | :TQ10003 | 
| 8 | HASH JOIN BUFFERED | | 
| 9 | PX RECEIVE | | 
| 10 | PX SEND HASH | :TQ10001 | 
| 11 | PX BLOCK ITERATOR | | 
| 12 | TABLE ACCESS STORAGE FULL | DWR_X_PLNGRM_LOC_BAY_CNT | 
| 13 | PX RECEIVE | | 
| 14 | PX SEND HASH | :TQ10002 | 
| 15 | VIEW | | 
| 16 | HASH GROUP BY | | 
| 17 | PX RECEIVE | | 
| 18 | PX SEND HASH | :TQ10000 | 
| 19 | PX BLOCK ITERATOR | | 
| 20 | TABLE ACCESS STORAGE FULL| DWB_X_SEL_ITEM_UNITS_DAY | 
| 21 | PX RECEIVE | | 
| 22 | PX SEND HASH | :TQ10004 | 
| 23 | PX BLOCK ITERATOR | | 
| 24 | TABLE ACCESS STORAGE FULL | DWB_X_PLNGRM_ATTR_DAY | 
-----------------------------------------------------------------------------

Plan 1932271691 joins the table in the same order as 3064471754 and looks logically the same, so I think this is preferable to the bad plan (which is also in the baseline).
I’m not sure why the optimizer isn’t choosing plan 1932271691.

I created an SQL Tuning Set from a snapshot range containing yesterday’s good run and confirmed that the only plan in there was 3064471754.

I uploaded the STS to the SQL baseline, however it does not add a new plan, it still shows:

sql handle: SQL_0e4f45d7832dd5ab
SQL_PLAN_0wmu5uy1kvpdb8c1d711e
SQL_PLAN_0wmu5uy1kvpdb1eaf5716

The only difference is that the timestamp of SQL_PLAN_0wmu5uy1kvpdb8c1d711e is now current so it looks like it’s been updated.

When querying the baseline in detail I expect to see SQL_PLAN_0wmu5uy1kvpdb8c1d711e now contains 3064471754, however it doesn’t.
It still shows 1932271691, nothing has changed except the timestamp, it looks like we can’t fix the good plan that has been used every day for months.

I fixed the potentially good plan and flushed the bad plan from all instances, however upon retrying the MERGE the bad plan was still chosen.

As an experiment I ran an explain plan of the MERGE in SQL Developer replacing the bind variable with the literal that was being passed.
The explain plan returns the good plan (same as 3064471754).

As a workaround I ask the team to run the MERGE from SQL Developer with the literal (after confirming their explain plan also returns the good plan).

Upon monitoring the SQL Developer session it is confirmed that it is using the good plan with the exact same plan ID (3064471754) and completes in 1 hour 15 minutes.

As it is Friday afternoon we agree on this workaround for the weekend (support team is 24 hours / shifts).

The next day the batch job MERGE has picked a different bad plan.
The support team kill it and run the workaround.

The workaround then stops working well as it has picked the same bad plan as the batch job.
The SQL Developer MERGE runs for 5 hours before being killed.

The new bad plan is 1501257745.

1501257745 looks logically the same as the good plan (3064471754) and the potentially good plan in the baseline (1932271691) as it has the same join order.

However we now have a breakthrough – the new bad plan has now given us a new angle to look at.

1501257745 starts by spawning many parallel processes, however it soon drops off to one serial process.
This one serial process is performing the top-level MERGE operation which flatlines for hours before being killed.

I decide to retry fixing the original good plan.
I start with a clean slate and drop the SQL baseline and both plans contained within.
I create a new STS from a different snapshot period than before which only contains 3064471754.
I load the STS into the baseline and a new baseline and plan are created.

When querying the contents of the baseline it still just contains the potentially good plan from before (1932271691).

Oracle is now deliberately and consistently changing my plan in transit from 3064471754 to 1932271691.

With the new insight given by the new bad plan (1501257745) I now look at the top-level DML parts of the plans.

Logically the potentially good plan (1932271691) and the new bad plan (1501257745) are the same statement as 3064471754 but without the parallel MERGE.

Plan hash value: 1932271691 and 1501257745 
----------------------------------------------------------------------------- 
 | Id | Operation | Name | 
----------------------------------------------------------------------------- 
 | 0 | MERGE STATEMENT | | 
 | 1 | MERGE | DWB_X_PLNGRM_ATTR_DAY |


Plan hash value: 3064471754 
---------------------------------------------------------------------------------- 
| Id | Operation | Name | 
 ---------------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | PX COORDINATOR | | 
| 2 | PX SEND QC (RANDOM) | :TQ10007 | 
| 3 | INDEX MAINTENANCE | DWB_X_PLNGRM_ATTR_DAY | 
| 4 | PX RECEIVE | | 
| 5 | PX SEND RANGE | :TQ10006 | 
| 6 | MERGE | DWB_X_PLNGRM_ATTR_DAY |

Now we are concentrating on the parallelism of DML, I remember working at a site previously and having to set ENABLE PDML at the session level.

I ran an explain plan of the MERGE before and after issuing the following statement:

ALTER SESSION ENABLE PARALLEL DML;

Afterwards the plan changed to include parallel steps for the MERGE.

I told the support team to do this in their SQL Developer session and the MERGE began running with the usual good plan and even had the same plan id (3064471754).

Now that PDML is enabled in my session, I reloaded the same STS as before into the same baseline.

Once again the only high-level change was the changing of the timestamp on the plan, however when querying the detail of the plan it no longer contained 1932271691 but a new plan, 2197295520.

Plan hash value: 2197295520 
 
---------------------------------------------------------------------------------- 
| Id | Operation | Name | 
---------------------------------------------------------------------------------- 
| 0 | MERGE STATEMENT | | 
| 1 | PX COORDINATOR | | 
| 2 | PX SEND QC (RANDOM) | :TQ10007 | 
| 3 | INDEX MAINTENANCE | DWB_X_PLNGRM_ATTR_DAY | 
| 4 | PX RECEIVE | | 
| 5 | PX SEND RANGE | :TQ10006 | 
| 6 | MERGE | DWB_X_PLNGRM_ATTR_DAY | 
| 7 | PX RECEIVE | | 
| 8 | PX SEND HYBRID (ROWID PKEY) | :TQ10005 | 
| 9 | VIEW | | 
| 10 | HASH JOIN BUFFERED | | 
| 11 | PX RECEIVE | | 
| 12 | PX SEND HASH | :TQ10003 | 
| 13 | HASH JOIN BUFFERED | | 
| 14 | PX RECEIVE | | 
| 15 | PX SEND HASH | :TQ10001 | 
| 16 | PX BLOCK ITERATOR | | 
| 17 | TABLE ACCESS STORAGE FULL | DWR_X_PLNGRM_LOC_BAY_CNT | 
| 18 | PX RECEIVE | | 
| 19 | PX SEND HASH | :TQ10002 | 
| 20 | VIEW | | 
| 21 | HASH GROUP BY | | /****only difference from original good plan (which has SORT GROUP BY instead)****/ 
| 22 | PX RECEIVE | | 
| 23 | PX SEND HASH | :TQ10000 | 
| 24 | PX BLOCK ITERATOR | | 
| 25 | TABLE ACCESS STORAGE FULL| DWB_X_SEL_ITEM_UNITS_DAY | 
| 26 | PX RECEIVE | | 
| 27 | PX SEND HASH | :TQ10004 | 
| 28 | PX BLOCK ITERATOR | | 
| 29 | TABLE ACCESS STORAGE FULL | DWB_X_PLNGRM_ATTR_DAY | 
----------------------------------------------------------------------------------

The tables are joined in the correct order and the MERGE is parallel.
The only difference is line 21 which has HASH GROUP BY instead of SORT GROUP BY.

I fixed this plan hoping that the batch version of the MERGE will pick it up tomorrow.
It doesn’t, however it now runs with the original good plan every day (3064471754).

Whilst I’m still looking into this a key point to take away is that when using SQL baselines the environment of the session creating the baseline and the of the session accessing the baseline counts toward the eligibility of the plan.

Advertisements

Posted in Oracle | Tagged: , , | 3 Comments »