Large audit trail table causes high db activity – especially when using OEM
Posted by John Hallas on June 24, 2015
On various databases, apparently unrelated we have noticed high activity that seems to be associated with the query below. The quieter the database the more the query stands out.
SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT', 'YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp, COUNT(username) AS failed_count, TO_CHAR(MIN(timestamp), 'yyyy-mm-dd hh24:mi:ss') AS first_occur_time, TO_CHAR(MAX(timestamp), 'yyyy-mm-dd hh24:mi:ss') AS last_occur_time FROM sys.dba_audit_session WHERE returncode != 0 AND timestamp >= current_timestamp - TO_DSINTERVAL('0 0:30:00')
The detail of the audit table is that it contains 73M records and takes 8:07 minutes to run that query against.
The query is from an OEM metric and we found a bug in MoS – High Cpu Utilization from agent perl process running failedLogin.pl for audit_failed_logins metric (Doc ID 1265699.1)
Two reasons are shown for the cause
This is caused by unpublished BUG’s 9827824, 9556885, 7313103, and 7633167
There are 2 sides to this issue as follows:
1. The query run by the EM agent uses a TO_CHAR conversion in the where clause. Since this is converting the dates to characters, any index or optimization done at the table level will be ignored by the Oracle CBO (cost based optimizer). This has been reported as unpublished bug 9556885. This bug is fixed in the 188.8.131.52 cloud control agent, and in the Grid Control 184.108.40.206 PSU3 agent. Or there is a one off patch that can be applied to the 10.2.0.5 grid control agent (or 11.1.0 grid control agent)
2. If there is a large number of xml audit file in adump, the SQL will take longer due to unpublished BUG’s 7633167 and 7313103. These are database bugs, which are manifested in EM. This means that it would still be possible to see this behaviour in 12c. (but only step 2 of the workaround below would apply to 12c).
A patch is provided as part of Apply the 220.127.116.11 PSU3 agent patch. Or alternatively apply one off Patch:9556885 to the Grid Control management Agent that is monitoring the respective database. per the patch readme. This will fix the query that GRID uses for the metric.
However we are at agent 18.104.22.168 so the patch is not applicable. I therefore can think of 4 ways we could address this
1) Apply the individual patch 9556885 to each agent which could be a bit of a nightmare when we have 1100 agents
2) Index the audit table (aud$). Not particular standard and could cause an issue in a future upgrade so a good solution but not perfect
3) Remove the metric from the OEM template. Not ideal as it is one that is useful to us and required by security standards
4) Reduce the number of records in the table. We have a site standard to maintain 366 days worth of audit data for every database and we purge down daily to keep on top of that. However we have several databases with a high frequency of login/logout activity which is application driven where we see large audit tables and that seems the best way to manage them. It should be noted that high audit activity does not seem to always correlate with large volume databases
In this case reducing the data to 90 days (and keeping the previous 9 months worth in a parallel table) reduced the records down to 18M and the query took less than 2 minutes. If we put a process in place to move aud$ records from the aud$ table to another table every 3 months we should be able to manage the problem.
So whilst this issue was referenced in MoS as a bug the real issue is that we were keeping too much data in the audit table and we have to take steps to manage that data or suffer the consequences.
The lesson learnt from this is to monitor the growth of audit data and manage it so that it does not have unforseen impact on normal usage of the system.