Oracle DBA – A lifelong learning experience

New ASM power levels in 11.2.0.2 and beyond

Posted by John Hallas on February 13, 2015

I recently saw the following command in a script that was to be run and thought an error had been made and the power level should have been 5 not 500.

ALTER DISKGROUP DATA REBALANCE POWER 500;

Upon doing some research it was not a mistype but a new method of disk balancing which came in from 11.2.0.2

Previously setting the power limit from 0 to 11 basically caused an additional number of ARBx process to be created to match the power level and these were removed once the rebalance had finished.

That was a nice simple situation which I had no problem with. The range was adequate and I normally used between 4 and 7 depending on the usage of the system and any performance impact that might be caused. The impact was easy to monitor using a variety of tools as top or glance on a *nix platform

Now from 11GR2 onwards and when a database has disk group ASM compatibility set to 11.2.0.2 or greater the operational range of values is 0 to 1024 for the rebalance power. Note that if the value of the POWER clause is specified larger than 11 for a disk group with ASM compatibility set to less than 11.2.0.2, then a warning is displayed and a POWER value equal to 11 is used for rebalancing. Second point to note is that if a disk group is altered to a higher RDBMS value this operation cannot be reversed.

So what does that mean in practise? Well in my eyes it seems to be a basic change but it now seems very hard, well-nigh impossible to see the impact that the re-balance is having on the server and consequently I do not see the advantages of it other than possibly on massively high-end systems.

Firstly how did you know pre 11.2.0.2 what the power level was

  • Look in V$asm_operation view for the power value
  • Do a ps –ef|grep –i ARB |wc –l and see how many processes were running
  • Look in the asm trace log

However finding the power level in use is not that important, it is monitoring what it was doing that was key. So I would use Glance or top and see if ARB processes were key consumers of resource. I would look at disk and HBA throughput and take a view on how busy the system was and if it looked like it might be affecting performance I would drop the power level.

From 11GR2 onwards when a rebalance operation is started the RBAL process will spawn only one ARB process. This ARB process uses OS level parallel threads. The number of threads corresponds to the value of the power limit setting. The thread allocated can be observed in the ASM alert log as in the example below: (using a power level of 5)

Tue Dec 23 14:17:48 2014
ARB0 started with pid=30, OS id=22462
NOTE: assigning ARB0 to group 3/0xa1488c8 (TSTDATA) with 5 parallel I/Os

This is a section from the ASM alert log after using a rebalance power of 40

NOTE: starting rebalance of group 1/0xc04488bb (DATA) at power 40
Starting background process ARB0
Mon Feb 02 11:59:24 2015
WARNING: failed to retrieve ASM password file location (unable to communicate with CRSD/OHASD)
Mon Feb 02 11:59:24 2015
ARB0 started with pid=27, OS id=22535
Mon Feb 02 11:59:25 2015
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: assigning ARB0 to group 1/0xc04488bb (DATA) with 40 parallel I/Os
Mon Feb 02 12:02:41 2015
Mon Feb 02 12:04:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 301 seconds
NOTE: ASMB clearing idle groups before exit
Mon Feb 02 12:04:24 2015
Mon Feb 02 12:04:24 2015
NOTE: client +ASM:+ASM deregistered
Mon Feb 02 12:06:05 2015
NOTE: requesting all-instance membership refresh for group=1
Mon Feb 02 12:06:05 2015
GMON updating for reconfiguration, group 1 at 895 for pid 21, osid 23281
Mon Feb 02 12:06:05 2015
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA_0012 emptied
NOTE: erasing header (replicated) on grp 1 disk DATA_0012
NOTE: erasing header on grp 1 disk DATA_0012
NOTE: process _x000_+asm (23281) initiating offline of disk 12.1503951013 (DATA_0012) with mask 0x7e in group 1 (DATA) without client assisting
NOTE: initiating PST update: grp 1 (DATA), dsk = 12/0x59a478a5, mask = 0x6a, op = clear
Mon Feb 02 12:06:06 2015
GMON updating disk modes for group 1 at 896 for pid 21, osid 23281
Mon Feb 02 12:06:06 2015
NOTE: PST update grp = 1 completed successfully
NOTE: initiating PST update: grp 1 (DATA), dsk = 12/0x59a478a5, mask = 0x7e, op = clear
Mon Feb 02 12:06:06 2015
GMON updating disk modes for group 1 at 897 for pid 21, osid 23281
Mon Feb 02 12:06:06 2015
NOTE: cache closing disk 12 of grp 1: DATA_0012
Mon Feb 02 12:06:06 2015
NOTE: PST update grp = 1 completed successfully
Mon Feb 02 12:06:08 2015
GMON updating for reconfiguration, group 1 at 898 for pid 21, osid 23281
Mon Feb 02 12:06:08 2015
NOTE: cache closing disk 12 of grp 1: (not open) DATA_0012
Mon Feb 02 12:06:08 2015
NOTE: group 1 PST updated.
Mon Feb 02 12:06:09 2015
NOTE: membership refresh pending for group 1/0xc04488bb (DATA)
Mon Feb 02 12:06:09 2015
GMON querying group 1 at 899 for pid 14, osid 7263
GMON querying group 1 at 900 for pid 14, osid 7263
Mon Feb 02 12:06:09 2015
NOTE: Disk DATA_0012 in mode 0x0 marked for de-assignment
SUCCESS: refreshed membership for 1/0xc04488bb (DATA)
NOTE: Attempting voting file refresh on diskgroup DATA
Mon Feb 02 12:06:43 2015
NOTE: stopping process ARB0
Mon Feb 02 12:06:45 2015

What interested me was how to decide which power level to use and how to determine the impact and difference between each of them. Some sample numbers are below

Power level Minutes
1 175
5 27
11 15
20 14
40 13
400 12

 

If I can find a suitable system I might do some more testing in this area but what I was really concerned with was identifying whether the overall system impact was affected and by how much. I could not determine any metrics to show me that other than the HBA/disk activity was very busy but there did not seem to be an awful lot of variance whichever power levels I plugged in. I asked our Unix team to investigate, using both HPUX and OEL servers and again they could not find anything that shows to them what was happening under the cover and what the line” with xx parallel I/Os” actually meant.

I finally raised a call with Oracle and kept on asking the question to be finally told that “there is no way of identifying how many parallel I/O threads are in operation or what the impact is”.

I must admit I find this totally unsatisfactory. I cannot see how I could consider using a very high power level – anything more than ~20 on a production or other important system where a performance impact might be noticed and I think I will be tempted to stick to around 10 for future use. I along with many other DBAs do not have detailed access to storage metrics and I generally do not know how may disks have been mapped to a lun or what type of striping is on then – that is the domain of the SAN team and therefore given the lack of information available about the new power levels I cannot really see what benefits the new much higher limits provide.

I am sure this is a change for the positive but I do think it could be have been documented much better than it has been.

4 Responses to “New ASM power levels in 11.2.0.2 and beyond”

  1. i totally agree with you. New power limits are mystery, i’m still using the old values.

  2. jkstill said

    Hi John,

    Did you use ps -m or -H to see threads at the OS level?

Leave a comment