We have a Peoplesoft database that manages HR for ~80,000 staff. Oracle 10.2.03. Peoplesoft Tools 8.49 PS HR 9 HPUX 11.31 Itanium
There has been long running problems caused by I/O bottlenecks writing down to Hitachi XP SAN disks. We were using filesystem storage and disk_asynch_io was set to false
The root cause of our problem was that the Time and Labour batches contend when they need to acquire exclusive access to write dirty blocks to data files during truncate operations.
David Kurtz has written about a specific Peoplesoft feature around truncate http://www.go-faster.co.uk/docs.htm#local_write_wait
We migrated to using ASM disk and achieved significant benefits.
The testing stages were
1) Prepare a 2 hour Time and labour batch run which would generate significant throughput
2) Capture a baseline at 10.2.0.3
3) Upgrade to 10.2.0.4 and retest
4) Upgrade to 10.2.0.4 and migrate to ASM (220.127.116.11) and retest
5) Upgrade to 18.104.22.168 and migrate to ASM (22.214.171.124) and retest
Test 3 did not show any difference at all. We were still blocked on disk, which was expected. We probably would have remained at 10.2.0.3 but gone to ASM but there are problems with seeing ASM disk when using 10.2.0.3 (Metalink Note 434500, ORA-15059) so we had the option of applying a fix to get ASM working with 10.2.0.3 or upgrading to 10.2.0.4. We decided that the benefits of getting onto a terminal release and suggestions from Oracle that small changes in the truncate process had been made between 10.2.0.3 and 10.2.0.4 out-weighed the overhead of any application testing that we needed to do.
Test 4 gave us big performance benefits. Test 5 was not as clear cut but we knew some code had to be optimised and we did not have the luxury of time to address the issues. Remaining on 10g minimised the application testing, certainly compared to the testing require to upgrade to 11g.
The benefits can be seen quite clearly in the graph below
This shows a 2 hour period where a constant throughput of T&L jobs were being processed. In parallel a constant online load was being simulated
The black line shows throughput and elapsed time. The run consists of a number of jobs which take 2 hours to complete and batch time is around the 450 – 600 second mark
The same processing is then run with 10.2.0.4 database and ASM. disk_aysnch_io is set to FALSE. The jobs complete 40 minutes faster and elapsed time is around the 300-400 second mark but showing spikes and inconsistent performance
Finally the orange line shows 10.2.0.4 database and ASM. disk_aysnch_io =TRUE. The processing time has now been reduced to just over an hour and the elapsed time is significantly better at an average of 250 seconds and with a very consistent profile.
As with all things in life we have now moved onto the next bottleneck which is the wait flashback buf free caused because we have enabled flashback logging and as the redo log buffer fills up we wait for flashback to be written out. However this is a miniscule issue compared to the original problems. We have taken flashback off some tablespaces primarily used for PS temporary tables (aka working storage tables) although that does give us an issue when when rebuilding the primary database after a DataGuard failover using flashback logs
I hope this real world example is of use