Oracle DBA – A lifelong learning experience

The power of emdiag

Posted by John Hallas on August 20, 2010

I am currently loooking at emdiag and finding it more and more useful as I fully understand it’s capabilities. To copy a comment from a metalink note – EMDIAG is a diagnostics and troubleshooting kit which can help with  a health assesment of a site. It is a set of scripts developed by Werner De Gruyter and instructions for download and usage are in Note 421053.1 EMDiagkit download and master index. I will not go into the installation instructions here but just show a few of the commands that I am finding useful and an example of an issue that it is highlighted. Note that I have set my Oracle Home to be the OMS home and repvfy is found in OH/bin. However the output is located under wherever you have installed the emdiag software and in my case would be OH/emdiag/log

repvfy dump health -pwd password 

This gives an very good overview of repository DB specific information, database performance statistics , installed OMS patchsets and EM monitoring targets.  

repvfy -pwd password 

This loops through a list of modules and runs specific tests against each module. The output is a list of tests run and errors found. The first few lines of my current output show the following errors

verifyAGENTS
001. Agents without a monitored host target: 2
101. Active Agents with clock-skew problems: 21
113. Agents not uploading any data: 63
verifyASLM
verifyAVAILABILITY
100. Broken targets marked as UP: 1
verifyBLACKOUTS
verifyCA
verifyCREDENTIALS
700. Orphaned target credentials: 1

The complete list of modules available can be found by using the repvfy -h4 command

repvfy -h4

repvfy 2010.0514 - EMDIAG - Repository verification
Usage:
 repvfy [-{h}] [-i] [-t <trace lvl>] [-zip] <commands>
      [-usr <user>] [-pwd <pwd>] [-tns <tns alias>]
      [-module <module name>] [-test <test number>] [-level <level>] [-detail]
      [-name <obj name>] [-type <obj type>] [-col <obj col>] [-owner <obj owner>] [-guid <obj guid>]
      [-stime <start time>] [-etime <end time>] [-id <obj id>] [-vers <obj version>]
      [/{d|o} <home>] [/log <dir>] [/{sid} <name>] [/u {<env>,...}] [/v {<env>=<var>,...}]

-- Available modules for VERIFY --

  AGENTS           Grid Control Agents
  ASLM             Application Server Level Monitoring
  AVAILABILITY     Availbility sub-system
  BLACKOUTS        Blackout sub-system
  CA               Corrective Actions
  CREDENTIALS      Credentials
  DEVELOPMENT      Development/Test (internal only)
  ECM              Configuration Management
  EVENT            Event sub-system
  JOBS             Job sub-system
  LOADERS          Loader
  METRICS          Metrics
  NOTIFICATIONS    Notification sub-system
  PLUGINS          Plugins and extentions
  POLICIES         Policies and violations
  PROVISIONING     Provisioning setup and configuration
  RCA              Root Cause Analysis Engine
  REPORTS          Reporting framework
  REPOSITORY       Repository
  ROLES            Roles and privileges
  TARGETS          Targets
  TEMPLATES        Templates
  USERS            User sub-system

If we want to focus on a particular test that is indicating problems we can get more information by running that test in isolation and gathering both the sql used and the problems identified.  Test 101 is showing 21 agents with a clock time that is different by more than 120 seconds greater or less than the OMS server

verifyAGENTS
101. Active Agents with clock-skew problems: 21
repvfy verify agents -test 101 -pwd password -detail 

Two files have been created , a sql and a detail file. In my view one of the best features is that the command above produces the sql query that it is running. This is a very good way to find out which tables are being used and where data within the repository is stored. The sql file contains the following.

SELECT agent, timezone_region, difference "seconds",
       DECODE(SIGN(difference),-1,'-','+')||
               TRIM(TO_CHAR(MOD(FLOOR(ABS(difference)/3600),24),'09'))||'h'||
               TRIM(TO_CHAR(MOD(FLOOR(ABS(difference)/60),60),'09'))||'m'||
               TRIM(TO_CHAR(MOD(ABS(difference),60),'09'))||'s' clock_skew
FROM   (SELECT t.target_name agent, t.timezone_region,
               (p.last_heartbeat_utc-(MGMT_GLOBAL.TO_UTC(p.last_heartbeat_ts,t.timezone_region)))*86400 difference
        FROM   mgmt_emd_ping p, mgmt_targets t
        WHERE  p.target_guid = t.target_guid
          AND  p.status = 1
          AND  p.max_inactive_time > 0)
WHERE  difference NOT BETWEEN -120 AND 120
ORDER BY difference
;

The log file details which agents are out of synch with the OMS server whic for us raised an interesting question.

Within OEM there is a pre-built and locked report which is called “Agents Clock synchronization offset” which we have been using for a long-time. That shows that we have no agents that are more than a few seconds out and yet the emdiag query shows we have 21 targets that are differing by between 240 500 seconds. The OEM report is a locked down query so I have a SR open with Oracl;e to try and determine why we see the differences. Just for information the emdiag report is correct and the clocks are out on 21 servers. Might be worth trying out on your environments.

So that is a brief overview of how I am using emdiag and no doubt I will post more as I delve deeper.

Leave a comment