Exchange 2010 Management Pack for SCOM

Exchange 2010 Management Pack for SCOM Jul 13, 2012 8:27:36 GMT 5.5

Quote

Post by Messaging Tech 1 on Jul 13, 2012 8:27:36 GMT 5.5

Guidance, Tuning and Known Issues for the Exchange 2010 Management Pack for System Center Operations Manager

SUMMARY
This article is intended to give some best practice guidance along with workarounds to known issues involving the Exchange 2010 Management Pack (MP) running on System Center Operations Manager (SCOM). Please look through this document before calling for support or posting to the forums as the issue may be covered below. If you find that these issues are particularly troublesome or find additional issues that you want fixed please call into Microsoft Support and raise a Request For Hotfix with the Exchange group.

MORE INFORMATION

The Exchange 2010 Management Pack introduced a correlation engine, the Correlation Engine is a stand-alone Windows service that uses the Operations Manager SDK interface to first retrieve the health model (or instance space) and then process state change events. By maintaining the health model in memory, and processing state change events, the Correlation Engine is able to determine when to raise an alert based on the state of the system.

In response to a problem, several monitors change state, and the corresponding state change events are forwarded by the agent to the Root Management Server (RMS). Once received by the RMS, they are processed by the Correlation Engine, which may raise an alert via the RMS’s Software Development Kit (SDK) interface. This alert is then visible on the Operations Manager Console.

Correlation Factors

The actions taken by the Correlation Engine is determined based on the several factors.

Monitor state change events Monitors, which watch for the specific diagnostics from Exchange such as event log messages, performance counter thresholds, and PowerShell task output events, register state change events when they detect that a problem has occurred or cleared (red to green or green to red), or as agents become unavailable or are placed in maintenance mode (and subsequently made available, and/or removed from maintenance mode).

Typically, alert rules are configured to fire when green to red state changes occur. In the Exchange Server 2010 Management Pack, you’ll find that this is not the case. Specifically, alerts are not automatically raised by monitor state changes. The Correlation Engine may determine the best alert to raise.

Health Model The class hierarchy imported into Operations Manager by the Exchange Server 2010 Management Pack is extensive. The class hierarchy includes class relationships that define component dependencies throughout the system. By defining these component dependencies in the object representation of the product, the Exchange Server 2010 Management Pack is able to better understand the health of the Exchange organization. For example, if the Exchange Server 2010 Management Pack identifies Active Directory as offline, it will also report that Exchange messaging is not fully functional.

Timing The Correlation Engine works in 90-second intervals. When state change events for multiple monitors come in at the same time, it waits to see whether anything else potentially related to the failure is detected so that it can make the most effective determination of the root cause.

Correlation Algorithm
Overview of the Correlation Engine process

1. First, it connects to the Operations Manager SDK service to download the Health Model hierarchy and instance state (on service startup only, or as needed if errors require it).

2. Next, it queries Operations Manager for the latest state change events related to entities in the Exchange Management Pack.

3. If new Non-Service Impacting (NSI) state changes are detected, then it raises alerts for them.

4. Key Health Indicator (KHI) monitors are then evaluated, and "chains" of red KHI monitors are isolated. These "chains" indicate issues where a dependency has failed and is impacting dependent processes. Recognizing these relationships is the key step.

5. Alerts are raised for the root cause monitor in the KHI chain.

6. It then waits 90 seconds, and then starts over at step 2 above.

Additional points of interest regarding the correlation engine process

· If the "chain" of KHIs includes both error and warning monitors, then the alert is raised as an error, regardless of the class of the root cause monitor. For example, if a top-level process defines an error monitor to catch failure cases, and if it is correlated to a warning monitor in a dependency, then the alert will be raised against the dependency, but it will be marked as an error instead of a warning.

· Not every class relationship is used for alert correlation. See the Appendix: Class Hierarchy
(http://bemis/15/_layouts/ArticlePages/EditArticlePage.aspx?List=c24fff48%2D9741%2D405b%2Db4d6%2D277710496bfc&ID=1906&state=none#DSDOC_b9318dfa_9a55_4bd6_9ee7_0a54af914c)
later in this guide for the specific relationships used by the Correlation Engine.

· The KHI chain, including any forensic monitors, is included in the Alert Context field available in the properties of the final alert. This allows inspection of the monitors correlated to the given alert and, in the case of alerts firing from dependency monitors, is required to determine the specific failure referenced by the alert.

· Monitors in maintenance mode are simply skipped when evaluating the health model.

What is and is not Affected by Alert Correlation

A key point to understand about the Exchange Server 2010 Management Pack, and the Correlation Engine in particular, is what the Correlation Engine affects, and what it doesn’t affect.

The following items are different due to the Correlation Engine:

· Monitors are configured not to alert automatically on state change events. This allows the Correlation Engine to determine the best alert to raise (as described above).

· The Exchange Server 2010 Management Pack doesn't raise Exchange alerts that correspond to the health of your environment when the Correlation Engine is stopped. If the Correlation Engine is stopped, a general alert is raised to notify you that the Correlation Engine is not running.

The following items are not different due to the presence of the Correlation Engine:

· Overrides still work as expected; you can change certain values or disable monitors just as you do today.

· Monitors/objects in maintenance mode are skipped by the Correlation Engine. No special consideration is required since the monitors don’t raise state change events for consumption by the Correlation Engine.

· Per-monitor alert rules were added to the Exchange Server 2010 Management Pack. Per-monitor alert rules allow monitoring personnel to enter company-specific notes for a given alert into the Company Knowledge field, even when the alert rules aren’t used to raise alerts for their corresponding monitors.

· Other management packs are not affected by the presence of the Correlation Engine.

In summary, keep in mind that it’s just the "monitor state change to alert" step that is enhanced by correlation.

Operational Notes

Since the Correlation Engine needs to maintain the instance space of the management group in memory to determine related monitors/alerts, its memory footprint is relative to the number of instances in the management group. In plain terms, the more Exchange servers and databases you have, the more memory it will require.

In observing environments at Microsoft, the Correlation Engine scales roughly at about 5 megabytes per monitored Exchange server. There are factors that can drive this number up or down, but it’s a good starting point toward understanding the resource impact on the server hosting the service.

As stated above, the preferred location for the service is on the RMS role given the close SDK interaction and core functionality of raising alerts.

While SCOM 2007 is not limited to a number of managed servers, it is limited to the number of managed objects and relationships between them. SCOM by design is an object model based solution and any managed object defined in a management pack is tracked individually in SCOM. The more of these unique managed objects and any relationships for these objects, the more SCOM has to work at tracking the health and workflow processing for them.

The maximum number of tested objects and relationships per SCOM Management Group works out as such:

Maximum number of Managed Objects: 800,000 - which is based on 10,000 Agents each with 80 instances

Maximum number of Relationships: ~1,000,000

Now these are only the maximum tested numbers from the SCOM Development team. SCOM 2007 can manage more than these numbers, however SCOM performance starts to become impacted and monitoring may be impaired if these numbers are exceeded.

Major Note:
The Exchange Correlation Engine may not process alerting if there are too many managed objects, relationships or Groups containing a large number of objects. The noticed limit to relationships and group object members are:

Relationships: 600,000Group object members: 1,000,000

This is a known hard limit as the Correlation Engine will take too long to gather this information running into a timeout that will cause the process to re-start, to which it will hit a timeout and re-start continuously.

The Exchange 2010 MP creates a lot of managed objects due to the design with the Correlation Engine. The trouble this causes is that the number of managed objects and relationships in SCOM increase rapidly with any 1 Exchange server added to the Management Group. Here are some typical managed object and relationship counts based on the server added:

Common across all Exchange Servers:
20 Managed Objects
25 Relationships

CAS:

40 Managed Objects
40 Relationships

Transport:

20 Managed Objects
30 Relationships

Unified Messaging:

15 Managed Objects
20 Relationships

Mailbox:

Per Database Copy:
40 Managed Objects
65 Relationships

Note: These are approximate numbers and everyenvironment setup will be different.

Using these Numbers you get this many managed objects and relationships for a simple 4 server Exchange 2010 installation:

Total Managed Objects: 340

Total Relationships: 710

Additionally, if more Database copies are added to the environment, these numbers increase rather quickly even without adding any more servers. Let’s say we added 2 more databases requiring a total of 4 database copies. We’ve now added 160 new managed objects and 260 relationships. That’s almost 50% more managed objects than before and a third more relationships without adding any new servers.

Due to this kind of increase in the Management Pack, we quickly start reaching the maximum tested numbers for the management group. In fact, larger scale Exchange 2010 installations can only manage effectively 400-500 Exchange 2010 servers in a single SCOM management group depending on the environment.

Make sure to look at this scale when designing the SCOM monitoring environment.

Get SCOM Prepped
Besides the scale of the objects injected into SCOM, this Management Pack has a high dataflow rate as there are not only potentially hundreds of thousands of managed objects, but also monitoring criteria such as health states, performance and event data flow for them. To allow SCOM to work with this high data flow, there’s a few things to do to prep SCOM:

On the Root Management Server (RMS):
SCOM CU3+ is highly recommended as there are quite a few performance based fixes including setting the standard agent queue size at 100 MB instead of the old 15 MB. This is required for Exchange 2010 agents as the amount of data to submit can at times grow quickly and the small queue can cause the agent to drop data or even stop functioning.

Additionally, there are Registry Keys to update to allow the RMS to more effectively utilize the server resources and reduce additional unneeded churn. The table below covers some of these keys:

_{Registry Hive Key Type Value Description
HKLM\Software\Microsoft\Microsoft Operations Manager\3.0 GroupCalcPollingIntervalMilliseconds DWord 000dbba0 Changes the Group Calculation processing to 15 minutes.

HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Config Service Polling Interval Seconds Dword 00000078 Changes the Config Service Polling to 2 Minutes}

Finally for the RMS, ensure that there are no agents reporting directly to the RMS whenever possible. The Exchange 2010 MP hosts a lot of “Non-Hosted” managed objects on the RMS which has to process a lot of health states as well as all alerting occurs from the RMS. Having the RMS process agent processing and dataflow can hinder this process and should if at all possible be avoided.

On all Management Server(s):

For all Management Servers including the RMS, there are a few more registry keys to update to allow for better resource utilization for SCOM processing. The table below covers some of these keys:

Collapse this tableExpand this tableRegistry Key Type Value Description
HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Cache Maximum Dword 00019000 Allows more memory usage for the Health Service’s Data store on the local system.
HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Version Store Maximum Dword 00002800
HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Checkpoint Depth Maximum Dword 06400000
HKLM\System\CurrentControlSet\Services\HealthService\Parameters\State Queue Items Dword 00005000 Allows more data be allowed to store in the Health Service’s Data store on the local system.

Note: These updates do not apply to gateway servers.

SCOM Data warehouse: (Where applicable.)

The Exchange 2010 MP adds some new Datasets to the SCOM DW for custom reporting. These new datasets have their own set of aggregations that can take a bit more time to complete than normal. Thus, you need to increase the timeout for DW processing to allow these aggregations to finish.

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse]

"Command Timeout Seconds "=dword:00000384 – Updates the DW processing timeout from 5 minutes to 15 minutes.

Note: This needs to be done on every Management Server (including the RMS).

Note: This may require to be up to 30 minutes for the timeout depending on how much data flow there is. (Mainly event and performance collection.)

MP Changes

The sheer amount of monitoring in the Exchange 2010 MP provides a whole host of alerting criteria never really experienced in SCOM previously. This is by far the largest MP to date from Microsoft, and provides a massive amount of visibility to Exchange issues. However, there are just some things in the Management Pack that just don’t work. Some of these can actually hinder or even stop all SCOM alert processing due to the inherent nature of the check/alert. These issues become much more visible when you start to scale up. Due to this some essential monitoring and override disables for some of the most common faulty monitoring in the Exchange MP. These are designed to provide more accurate monitoring for highly critical issues as well as reduce the chance of impacting SCOM monitoring.

Monitors:

Disable the following Monitors

· KHI: Database dismounted or service degraded.

· KHI: The Database copy is not mounted on the local server. Another database copy may be mounted on a different server.

Optional Overrides:

All Non-reporting based Performance Collections should be disabled, there are 504 Performance Collection rules to disable. After disabling these each environment will need to review these performance collections and re-enable what they care about in their environment as needed.

Other Considerations

Due to how the MP is designed, the Correlation Engine has to cache the Exchange infrastructure to be able to determine if all is healthy. It is highly important that all Exchange servers in an Exchange Site are in the same SCOM management group. Having only part of the whole Site in a single SCOM management group will cause a lot of noise as the Correlation Engine is expecting to see all these servers in the site, but does not see them in SCOM. Please plan accordingly to ensure that all active Exchange servers in each site are properly monitored in the same SCOM management group. (Best example is the North America Site is managed in one SCOM management group, while the South America Site is managed in another SCOM management group.)

Additionally, if any monitoring does not seem to be correct and causing churn/noise, turn it off by disabling the corresponding monitor. Once disabled, review the criteria and determine if it’s actionable and if any additional tuning is needed. It’s better to stop the alerting for a short time to ensure SCOM isn’t about to break, than allow a noisy alert that will mask other potential issues.

Finally, make sure that your SCOM agents are healthy. Put in a remediation process for agents that don’t report in. (You can toss a recovery process on “Health Service Heartbeat Failure”) A lot of the Exchange monitoring is based on the best health of n servers. If the 1 healthy server is not reporting in, SCOM thinks it’s all unhealthy and can go pretty nuts in the process. (At this MP most likely has 50+ monitoring criteria associated with that 1 healthy server.) Keeping the agents reporting in is key to ensuring monitoring is accurate.

Common Errors

Symptom
MicrosoftExchangeServerRoleDisovery.js returns empty discovery data for non-domain servers. Server will not be discovered as an Exchange server role in the Operations Manager database.

Cause
The MicrosoftExchangeServerRoleDisovery.js script creates a propertybag that returns the role of the every Exchange server. The script does not return any error or give any indication why it failed unless an override is placed on the discovery to enable VerboseLogging=True.

The script looks for the following parameters to be populated:

· Computer Principal Name

· Computer Netbios name

· Computer Active Directory Site

· Computer DNS name

· Install Path

· Version

If any of these parameters are not populated, the script will return an empty property bag and the server will not be discovered as an Exchange server role in the Operations Manager database.

If an Exchange Edge server is in a workgroup, the ComputerActiveDirectorySite parameter is not populated. Because of this, the server will not be discovered by Operations Manager. Apparently, it is quite common for Edge servers to be workgroup machines, so monitoring them in Operations Manager is not possible without faking some value for this parameter.

Resolution
Add a registry key to the server so that it returns a non-NULL value for Active Directory Site:

1. Open Registry Editor

2. Navigate to HKLM/System/CurrentControlSet/Services/NetLogon/Parameters

3. Find the SiteName value for this key. Populate this value with any non-NULL string (such as "perimeter", "DMZ","Edge", etc)

Note: Do not update the DynamicSiteName value, as the NetLogon service can overwrite this data. The SiteName value is not automatically updated.

Symptom
Alerts raised by SCOM have 10 custom fields. In order to get the integration with other ticketing systems the custom fields are used. The Exchange Management Pack (Correlation Engine in particular) does use same fields for their internal needs and therefore overwrites values.

It - for example - stores the CorrelatedProblemId in CustomField10. And if the field is overwritten alerts cannot be closed.

Cause
The Exchange MP uses Custom Fields as Critical Values, connectors that use any of the following will experience this issue:

CustomField4

CustomField5

CustomField7

CustomField8

CustomField9

CustomField10

Resolution
As the Exchange MP uses these, workaround is to change the custom fields the connector uses.

Symptom
AD integration breaks after installing Exchange 2010 Management pack. The moment Exchange 2010 MP is installed on a server running SCOM Agent, AD integration gets broken for that Agent. The memberships in Primary and secondary groups are all showing up correctly, but just that Agent reads everything in AD and tries to connect to all the Management servers including RMS (where RMS is not even configured with AD integration)

Cause
The reason why this is happening is because when a box is installed with Exchange 2010, the machine account gets added to the following three additional Domain groups that are created.

· Exchange Trusted Subsystem (read and special)

· Exchange Servers (only special)

· Exchange Windows Permissions (only special)

And these three groups have permissions in the Domain level itself. So, it gets inherited in “OperationsManager” and subsequent Management Group containers / SCP’s. When Agent has health service running under “local system”, and when it starts up, it is able to read everything in AD under the OperationsManager container.

Resolution
The issue is fixed by removing these three groups from the “OperationsManager” container thereby stopping inheritance.

Symptoms
Using System Center Operations Manager 2007, you import the Exchange 2010 Management Pack. As per the Exchange 2010 MP guide, all the object discovery rules are enabled by default and should automatically discover all Exchange 2010 roles and start monitoring them. This does not happen, thus you face a problem where none of the Exchange 2010 Server roles are getting discovered or getting monitored. It also does not log any error or throw any alert saying that discovery failed.

Causes
· This can occur if you install the 32-bit (x86) agent on a 64-bit (x64) based operating system or platform.
· This can happen if your Exchange 2010 Server roles are clustered. For example, the Mailbox server role or CAS server role is installed on Windows Cluster Server.

Resolutions

· Install the proper agent for the platform or OS hosting the Exchange Server roles.
· Make sure that OpsMgr 2007 R2 Agent is installed on all clustered nodes. Then from the OpsMgr Console -> Administration pane -> Device Management -> Agent Managed, go to each and every agent computer and from the security tab, enable the Agent Proxy check box. Restart System Center Management Service on each agent computer after doing this and within few minutes all Exchange 2010 Server roles should get discovered and monitored as expected.

Symptom
When attempting to override an alert priority on an Exchange 2010 rule with the Exchange 2010 MP the override does not take effect. The default value is defined as $Data/EventData/CorrelatedContext/RootCause/Priority$.

CauseAlerts are generated by the Correlation Engine, so the overrides are not taking effect. Override scope is probably the issue.

Resolution
Overrides class should be selected for “All objects of Class: Root Management Server”

Symptoms
After deploying the Exchange 2010 Management Pack in a System Center Operations Manager environment, the Exchange 2010 MP may set the RMS in a critical state with the following error:

Failed to deploy Data Warehouse component. The operation will be retried. Exception 'DeploymentException': Failed to perform Data Warehouse component deployment operation: Install; Component: Script, Id: '0672dd6a-1e36-2336-b1f0-f701fe67f8a2', Management Pack Version-dependent Id: 'ab06eb14-eaf1-0f0b-04b8-f1cdd33f4acc'; Target: Database, Server name: 'serverName', Database name: 'OperationsManagerDW'. Batch ordinal: 15; Exception: Must declare the scalar variable "@splitvalue". Must declare the scalar variable "@splitvalue". One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.DataWarehouse.Deployment.Component Instance name: <FQDN> Instance ID: {05432A69-69F6-2B53-2D79-52BD1AC6E289} Management group: groupName

If the Exchange 2010 MP is removed then the health will return to normal (green).

Cause
This can occur if DB collation is set to be case sensitive.

Resolution
Change the DB collation to be case insensitive to resolve this issue.

Symptoms

When you drill down to a sub report "Top Alerts" of the SLA report in the Exchange 2010 SP1 management pack for System Center Operations Manager 2007 R2 you get the error:

An error has occurred during report processing, Query execution Failed For dataset TopAlerts.
Cannot Find either column ‘Exchange2010” or the user-defined Function or aggregate
Exchange2010.GetserverRole, or the name is ambiguous.

Resolution
Don't use this report.

Symptoms
High amounts of Config Churn are noticed after importing the Exchange 2010 Management Pack

Cause
There are two discoveries in the Exchange 2010 RTM MP version 14.0.650.8 that cause some significant config churn. The two discoveries target the “Mailbox” class – and are:

· Microsoft.Exchange.2010.Mailbox.MdbOwningServerLocalEntityDiscoveryRule

· Microsoft.Exchange.2010.Mailbox.MdbOwningServerRemoteEntityDiscoveryRule

These run every 14400 seconds by default (4 hours). The problem is they collect some properties that change with every run of the discovery. The commonly churning properties are:

· DatabaseSize

· DbFreeSpace

· LogDriveFreeSpace

Resolution
Upgrade to the latest version of the Exchange 2010 MP for SP1 as this changes these values to NULL, otherwise consider overriding these to run once per day (that is the maximum supported for this particular discovery) until this condition is resolved with an updated MP. This will not solve the config churn, it will simply reduce the amount caused by this specific workflow. Once per day is 86400 seconds. If you try to set it for more than 86400, you will get an error from the Scheduler Data Source Module about the synch time error.

Symptoms
Account Lockout, some customers who have enabled account lockout policies in their environment have reported issues with the test user being locked out.

Resolution
If you experience lockout problems in your environment, see Microsoft Knowledge Base article 2022687,
Exchange Test CAS Connectivity user gets locked out when using Exchange 2010 MP
(http://go.microsoft.com/fwlink/?linkid=3052&kbid=2022687)
(http://go.microsoft.com/fwlink/?linkid=3052&kbid=2022687).

Symptoms
Event Messages Concerning MSExchange Management Event Log. If the Exchange Server 2010 Service Pack 1 (SP1) version of the Management Pack is imported before all Exchange servers are upgraded to Exchange Server 2010 Service Pack 1 (SP1), the event log message below may be logged regularly.

Log Name: Operations Manager
Source: Health Service Modules
Event ID: 26004
Level: Error
Description:
The Windows Event Log Provider is still unable to open the MSExchange Management event log on computer 'server'. The Provider has been unable to open the MSExchange Management event log for 565200 seconds.
Most recent error details: The specified channel could not be found. Check channel configuration.
One or more workflows were affected by this.

Cause
The logging of this event is expected behavior when servers that have the RTM version of Exchange 2010 installed use the Exchange 2010 SP1 Management Pack. The Exchange Server 2010 Service Pack 1 (SP1) version of the Management Pack will still monitor Exchange computers that are running Exchange Server 2010 Service Pack 1 (SP1) and Exchange Server 2010 RTM while this event is being logged.

Symptoms
Size of the Operations Database grows out of control after disabling rules in the Exchange 2010 MP.

Cause
The Correlation Engine checks to see if alerts are created before creating a new event in the PendingSDKDatasource table. The basic components are Monitors and their matching rules. Each monitor has a corresponding rule. The monitors change state and the Correlation Engine picks up on them. If it is a new issue a new event is created via the SDK and the event description contains the whole chain of monitors. The rules then look for the SDK events and create an alert. If the alert is disabled the Correlation Engine will continue to insert events. Depending on the timing of some of the monitors and the type of failure, the number of events will continue to grow. The side effects are the size of the PendingSDKDatasource table grows quite large and the rules have trouble keeping up with the number of events. This may cause the MonitoringHost process running those workflows to consume very large amounts of memory (Private Bytes). This can have a negative overall performance impact on the RMS if that is where the Correlation Engine resides. The PendingSDKDatasource table does groom once a day. But depending on how unhealthy the Exchange environment is this may be too large an interval. The main takeaway for this is DO NOT disable an Exchange alerting rule unless you also disable the corresponding monitor

Resolution
Disable the monitors that correspond with the alerting rules.

Symptoms
When you drill into the “Top Alerts” sub report on the SLA report you get the below error:
An error has occurred during report processing, Query execution Failed For dataset TopAlerts.
Cannot Find either column ‘Exchange2010” or the user-defined Function or aggregate Exchange2010.GetserverRole, or the name is ambiguous

Cause
This report is deprecated and not to be used.

Resolution
Do not use this report.

Note This is a "FAST PUBLISH" article created directly from within the Microsoft support organization. The information contained herein is provided as-is in response to emerging issues. As a result of the speed in making it available, the materials may include typographical errors and may be revised at any time without notice. See Terms of Use
(http://go.microsoft.com/fwlink/?LinkId=151500) for other considerations.

Messaging Techs

Exchange 2010 Management Pack for SCOM

Post by Messaging Tech 1 on Jul 13, 2012 8:27:36 GMT 5.5

Shoutbox