Oracle DBA – A lifelong learning experience

Grid control sends false alerts with “Agent to OMS communication broken” message

Posted by John Hallas on July 12, 2011

We have been seeing an increasing number of alerts stating that OEM cannot ping an agent. These then generate alerts and incidents and potential callouts. The situation was getting increasingly worse and therefore we started some investigation as we had put it down to a busy network and the fact we have a lot of distributed agents.

The error message is Message=Agent is unable to communicate with the OMS. (REASON = Agent is Unreachable (REASON : Agent to OMS Communication is broken ). Severity=Unreachable Start

We are on GC 10.2.0.5. We came across Note 9276193.8 which highlights bug 9276193 –  gc sends false alerts with “agent to oms communication broken” message

There are two workarounds suggested :-

Turn off alerts notification – which is a bit of a joke really
Increase max_inactive_time in emd_ping table to a large value – the table name is actually mgmt_emd_ping.

Currently the default value is 120 seconds and we upped it to 240 and that resolved our problems.
Below is a test case showing a selection of agents and their target guids and how we proved the fix.

Find details of suitable xxx targets

 

SELECT b.target_guid,  a.target_name,  b.max_inactive_time FROM mgmt_targets a,  mgmt_emd_ping b
WHERE a.target_guid = b.target_guid and upper(a.target_name) like '%xxx%';  

  ——————————– —————————————
FFC8DEFF86B5C750588391B57FD43214 target A    240
917D961D82D8235108D2D3FA18322BBE target B    240

Update target A back to 120

 

update mgmt_emd_ping set max_inactive_time = 120
where target_guid = 'FFC8DEFF86B5C750588391B57FD43214';

****Nothing – maybe too much to ask for just 1 target to send an alert******

 Update all xxx targets back to 120

 

select target_guid from (SELECT b.target_guid,  a.target_name,  b.max_inactive_time
FROM mgmt_targets a,  mgmt_emd_ping b WHERE a.target_guid = b.target_guid
and upper(a.target_name) like '%xxx%');

TARGET_GUID
——————————–
C431649F98806B0DD3F47F9FBADC8E2E
E3204B4D77350D68095E29B6E7A7218F
60212915037A6D5D226A8F63715EE7C5
ACEE6F5536B726CFF040C8DD3F234E27
1EE675CE1CE275728C673EE1241481FD
34B7FEB8D7AB6FED56AF801CCF4AB967
233DD134EA4F805B8037D1DCFB1BA3F0
96183E9A4AA24E82A57AF4D4BA704C16
814F3EDA7CE206B38716483ABA9DB003
917D961D82D8235108D2D3FA18322BBE
EC2758A767CE1588D6C3325FD72300FE
99D16DB95152675563132A060C56267C
1E098EB594F123BD40F85FE0C65094F7
5799865D9B7E287F78B3764104FA0A07
FFC8DEFF86B5C750588391B57FD43214
699913BCC4F207B117FA3A8DBFB4926A
8DAEF1FEA2C257DB22CDB140BC14B613
521521F6EE88BFF52E119694D912937D
E7BA1AFCFE831927E5FBF6B0E22E454B
9BF87C8EBCB4C62FBABDDBA25DA8B12C
7B46F1DFFE7E7F34634FD9399E4496E9

Update targets

 

update mgmt_emd_ping
set max_inactive_time = 120
where target_guid in (
'C431649F98806B0DD3F47F9FBADC8E2E',
'E3204B4D77350D68095E29B6E7A7218F',
'60212915037A6D5D226A8F63715EE7C5',
'ACEE6F5536B726CFF040C8DD3F234E27',
'1EE675CE1CE275728C673EE1241481FD',
'34B7FEB8D7AB6FED56AF801CCF4AB967',
'233DD134EA4F805B8037D1DCFB1BA3F0',
'96183E9A4AA24E82A57AF4D4BA704C16',
'814F3EDA7CE206B38716483ABA9DB003',
'917D961D82D8235108D2D3FA18322BBE',
'EC2758A767CE1588D6C3325FD72300FE',
'99D16DB95152675563132A060C56267C',
'1E098EB594F123BD40F85FE0C65094F7',
'5799865D9B7E287F78B3764104FA0A07',
'FFC8DEFF86B5C750588391B57FD43214',
'699913BCC4F207B117FA3A8DBFB4926A',
'8DAEF1FEA2C257DB22CDB140BC14B613',
'521521F6EE88BFF52E119694D912937D',
'E7BA1AFCFE831927E5FBF6B0E22E454B',
'9BF87C8EBCB4C62FBABDDBA25DA8B12C',
'7B46F1DFFE7E7F34634FD9399E4496E9');

Results

******2 alerts appeared after 10 minutes of the update and unreachable and a clear within 1 minute*********
Set back to 240

 

update mgmt_emd_ping set max_inactive_time = 240 where target_guid in (list of guids.....) );

New targets are set with the default value of 120 (and we have not found out how to change that default value as yet) so we have a scheduled grid job running  to change anything with 120 to 240

7 Responses to “Grid control sends false alerts with “Agent to OMS communication broken” message”

  1. Turn off alerts notification – which is a bit of a joke really!!!🙂

    Nice post, thanks for sharing!

  2. Hi John

    Thanks for pointing this out – we have a client that wakes our on call guys once a week with this issue. The default is set where it should be, in the table definition.

    CREATE TABLE “SYSMAN”.”MGMT_EMD_PING”
    ( “TARGET_GUID” RAW(16) NOT NULL ENABLE,
    “STATUS” NUMBER DEFAULT 1,
    “LAST_HEARTBEAT_TS” DATE DEFAULT NULL NOT NULL ENABLE,
    “LAST_HEARTBEAT_UTC” DATE DEFAULT NULL NOT NULL ENABLE,
    “CLEAN_HEARTBEAT_UTC” DATE NOT NULL ENABLE,
    “STATUS_SYNC_UTC” DATE NOT NULL ENABLE,
    “EMD_UPTIME_UTC” DATE NOT NULL ENABLE,
    “UNRCH_START_TS” DATE DEFAULT NULL,
    “MAX_INACTIVE_TIME” NUMBER DEFAULT 120,
    “DOWN_REASON_CODE” NUMBER DEFAULT 0,
    “DOWN_REASON_MSG” VARCHAR2(1024 BYTE) DEFAULT ‘ ‘,
    “HEARTBEAT_RECORDER_URL” VARCHAR2(256 BYTE) DEFAULT ‘ ‘,
    “PING_JOB_NAME” VARCHAR2(64 BYTE) DEFAULT NULL,
    “JOB_SUBMIT_TIME” DATE DEFAULT NULL,
    “BOUNCE_CTR” NUMBER DEFAULT 0,
    “HB_RECEIVED_UTC” DATE DEFAULT NULL,
    CONSTRAINT “MGMT_EMD_PING_PK” PRIMARY KEY (“TARGET_GUID”)

    • John Hallas said

      Glad it was useful for you. Good spot re the table DDL as setting the default value. I just figured it was a config file somewhere. We can remove our scheuled job and just change the default value to 240 for any new targets that we register.

  3. Thanks for this post… quite useful!
    Cheers!

  4. שלומי said

    Just understand that changing that default to 240 (4 minutes) means – you will not get true alerts as well.
    I did it on my systems and while failing over all of my systems on a scheduled downtime – i did not get any alert except for the one DB that took more than 4 minutes to failover.

    We decided for now to livee with false alerts, rather than not getting true alerts.

  5. Mike said

    Hi John,
    I’m seeing the same kind of error occasionally generated from my OEM 12c system. The tables and columns in your SQL for increasing max_inactive_time still exist but are not populated with a default value. Research hasn’t found any further info on this. Do you know if there’s a different way of checking & defining this value for 12c? Many thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: