The power of emdiag
Posted by John Hallas on August 20, 2010
I am currently loooking at emdiag and finding it more and more useful as I fully understand it’s capabilities. To copy a comment from a metalink note – EMDIAG is a diagnostics and troubleshooting kit which can help with a health assesment of a site. It is a set of scripts developed by Werner De Gruyter and instructions for download and usage are in Note 421053.1 EMDiagkit download and master index. I will not go into the installation instructions here but just show a few of the commands that I am finding useful and an example of an issue that it is highlighted. Note that I have set my Oracle Home to be the OMS home and repvfy is found in OH/bin. However the output is located under wherever you have installed the emdiag software and in my case would be OH/emdiag/log
repvfy dump health -pwd password
This gives an very good overview of repository DB specific information, database performance statistics , installed OMS patchsets and EM monitoring targets.
repvfy -pwd password
This loops through a list of modules and runs specific tests against each module. The output is a list of tests run and errors found. The first few lines of my current output show the following errors
verifyAGENTS 001. Agents without a monitored host target: 2 101. Active Agents with clock-skew problems: 21 113. Agents not uploading any data: 63 verifyASLM verifyAVAILABILITY 100. Broken targets marked as UP: 1 verifyBLACKOUTS verifyCA verifyCREDENTIALS 700. Orphaned target credentials: 1
The complete list of modules available can be found by using the repvfy -h4 command
repvfy -h4
repvfy 2010.0514 - EMDIAG - Repository verification Usage: repvfy [-{h}] [-i] [-t <trace lvl>] [-zip] <commands> [-usr <user>] [-pwd <pwd>] [-tns <tns alias>] [-module <module name>] [-test <test number>] [-level <level>] [-detail] [-name <obj name>] [-type <obj type>] [-col <obj col>] [-owner <obj owner>] [-guid <obj guid>] [-stime <start time>] [-etime <end time>] [-id <obj id>] [-vers <obj version>] [/{d|o} <home>] [/log <dir>] [/{sid} <name>] [/u {<env>,...}] [/v {<env>=<var>,...}] -- Available modules for VERIFY -- AGENTS Grid Control Agents ASLM Application Server Level Monitoring AVAILABILITY Availbility sub-system BLACKOUTS Blackout sub-system CA Corrective Actions CREDENTIALS Credentials DEVELOPMENT Development/Test (internal only) ECM Configuration Management EVENT Event sub-system JOBS Job sub-system LOADERS Loader METRICS Metrics NOTIFICATIONS Notification sub-system PLUGINS Plugins and extentions POLICIES Policies and violations PROVISIONING Provisioning setup and configuration RCA Root Cause Analysis Engine REPORTS Reporting framework REPOSITORY Repository ROLES Roles and privileges TARGETS Targets TEMPLATES Templates USERS User sub-system
If we want to focus on a particular test that is indicating problems we can get more information by running that test in isolation and gathering both the sql used and the problems identified. Test 101 is showing 21 agents with a clock time that is different by more than 120 seconds greater or less than the OMS server
verifyAGENTS 101. Active Agents with clock-skew problems: 21
repvfy verify agents -test 101 -pwd password -detail
Two files have been created , a sql and a detail file. In my view one of the best features is that the command above produces the sql query that it is running. This is a very good way to find out which tables are being used and where data within the repository is stored. The sql file contains the following.
SELECT agent, timezone_region, difference "seconds", DECODE(SIGN(difference),-1,'-','+')|| TRIM(TO_CHAR(MOD(FLOOR(ABS(difference)/3600),24),'09'))||'h'|| TRIM(TO_CHAR(MOD(FLOOR(ABS(difference)/60),60),'09'))||'m'|| TRIM(TO_CHAR(MOD(ABS(difference),60),'09'))||'s' clock_skew FROM (SELECT t.target_name agent, t.timezone_region, (p.last_heartbeat_utc-(MGMT_GLOBAL.TO_UTC(p.last_heartbeat_ts,t.timezone_region)))*86400 difference FROM mgmt_emd_ping p, mgmt_targets t WHERE p.target_guid = t.target_guid AND p.status = 1 AND p.max_inactive_time > 0) WHERE difference NOT BETWEEN -120 AND 120 ORDER BY difference ;
The log file details which agents are out of synch with the OMS server whic for us raised an interesting question.
Within OEM there is a pre-built and locked report which is called “Agents Clock synchronization offset” which we have been using for a long-time. That shows that we have no agents that are more than a few seconds out and yet the emdiag query shows we have 21 targets that are differing by between 240 500 seconds. The OEM report is a locked down query so I have a SR open with Oracl;e to try and determine why we see the differences. Just for information the emdiag report is correct and the clocks are out on 21 servers. Might be worth trying out on your environments.
So that is a brief overview of how I am using emdiag and no doubt I will post more as I delve deeper.
Leave a comment