ZCS Operational Best Practices - Monitoring and Operational Actions: Difference between revisions

(Replacing links in header tabs)
Line 1: Line 1:
{{TabHeader}}
{{TabHeader}}
{{Tab2|[[Zimbra Collaboration Suite Best Practices Procedures:Introduction|Introduction]]}}
{{Tab2|[[ZCS Operational Best Practices - Scope|Introduction and Scope]]}}
{{Tab2|[[Zimbra Collaboration Suite Best Practices Procedures:Operational Structure and Guidelines|Operational Structure and Guidelines]]}}
{{Tab2|[[ZCS Operational Best Practices - Operational Structure and Guidelines|Operational Structure and Guidelines]]}}
{{Tab2|[[Zimbra Collaboration Suite Best Practices Procedures:ZCS Best Practices|ZCS Best Practices]]}}
{{Tab2|[[ZCS Operational Best Practices - Management Practices|Management Practices]]}}
{{Tab1|[[Zimbra Collaboration Suite Best Practices Procedures:Monitoring and Operational Actions|Monitoring and Operational Actions]]}}
{{Tab1|[[ZCS Operational Best Practices - Monitoring and Operational Actions|Monitoring and Operational Actions]]}}
{{TabFooter}}
{{TabFooter}}



Revision as of 20:38, 26 March 2010

Introduction and Scope  


Operational Structure and Guidelines  


Management Practices  


Monitoring and Operational Actions  


 



This section details the actions that need to be performed as recommended in the above sections. These recommendations contain the exact commands and methods of procedure to perform these operations. However, in many cases the expected results for the operations are not detailed. Since there are many variables in how a system may be configured and what equipment is deployed, each site’s acceptable values must be established by their respective operations teams. In most cases a ‘reasonable’ value has been detailed in this document.

This section also documents the minimum level of operational activities and ‘key performance indicators’ (KPIs) that should be performed to manage and operate the ZCS systems. All the items in Section 3 must be addressed to achieve a ‘Best Practice’ level of operations; the items specified herein are the very minimum acceptable levels.

Minimum Key Performance Indicators

If a starting point for the Key Performance indicators is required for the site, then the following factors are the smallest list of items that should be monitored for the site. These will provide the ‘basic’ level of service figures about user accessibility to the site.

  • Core ZCS accessibility
    • MBS (Mail Box Server)
    • POP
    • HTTP
    • IMAP
  • MTA accessibility
  • Proxy accessibility
    • IMAP
    • POP
    • HTTP
  • MTA response times for the initial banner and also sending a message
  • POP message retrieval time
  • MTA round trip timings
  • HTTP response times
  • HTTP load times
  • IMAP response and sync times
  • Directory / Authentication Services / Provisioning

This information can be obtained via interactive monitoring scripts, which is the recommended method. It can also be obtained afterwards by processing the server log files.

Minimum Automated System monitoring

At a minimum, the logger should be identifying critical events in conjunction with SNMP traps to monitor the ZCS server, disk space and the MySQL database. Email message and pager alarms should be triggered from this utility.

The log monitoring only provides a report on whether the system is running and about issues only within the time window in which it was run. Although the time period can be managed, it is important to note however, that this is a monitor approach.

Thus, to provide the Best Practice solution, extra-automated systems should be used. These automatically test an aspect of the user experience in the system. These extra items should be targeted at particular aspects of the system and gather the required statistics. A good policy is to have a monitor for each of the Key Performance Indicators (KPIs) that are to be monitored.

All scripts should log and/or make the information available for historical and trend gathering purposes. It is recommended that the information be presented in a graphical manner, as this is the easiest format for operators to visually inspect. Obviously, alarms and reports should also be triggered if the service levels are abnormal.

With this in mind the following monitoring processes are recommended. These processes should be triggered often, for example, every several minutes. The exact rate depends upon the site.

  1. Automated scripts should send a predefined message to several mail accounts (at least one on each of the active Mail Servers in the system). The time taken for the initial banner to be returned and then for the message to be sent should be checked.
    These or other scripts should then retrieve the sent message via the pop, imap and/or http proxy servers. The initial banner message and the total time for the operation should be checked.
    If the time for any of the operations is above the level acceptable for the site, service level alarms should be triggered and on duty operators informed that an issue is occurring.
  2. An automated script should also access an account on each MBS via HTTP. This script should request the default login page, login into the interface and log out. Basic checks must be performed upon the returned pages to ensure the operations were successful.
    Also as with the other services the timing for the operations should be checked and if they are outside acceptable limits, service level alarms should be triggered and on duty operators informed that an issue is occurring.
  3. An automated script should also interface into the provisioning system. This should change a value (the password for example) on a special test account. Checks should then be made to ensure that the change has been made for the users account.
    This means that at least one provisioning change will occur in any period. This can be useful for other factors and the update of the change to all the directory caches in the system can also be checked. If the change is not pushed to all directory caches in the expected time period the service level alarms should be triggered and on duty operators informed that an issue is occurring.

Additional Automated System Monitoring

Sections 4.1 and 4.2 above list a minimum set of operational monitoring factors and processes. These allow problems to be identified quickly and monitor the system to ensure that a good level of service is being delivered to end customers.

However, most the above actions are all reactively monitoring the system. To employ Best practices, the system should also be monitored proactively in order to identify and possibly mitigate issues that may occur. Thus, scripts and/or probes should be created to monitor the factors outlined below.

Some of the metrics below are more useful than others depending on each individual site, and not all sites need to monitor all factors. This list is not definitive because, as stated in Section 3.2, other factors may be identified to be equally or more important depending on the requirements of a specific site.

It is recommended that the monitoring results be processed into a visual/graphical format. The graphs and/or tables should contain either an optimal or expected plot so that the current data can be easily interpreted. This is to ensure that the values can be easily interpreted to be within normal boundaries.

Trend analysis is also recommended on most of these factors. This information can be used to create the User and Machine profiles for the site. Because of these factors some of this information can be obtained from the historic value in the server .log and .stat files.

The ‘Recommended Period between checks’ is shown in the tables below. This is a very subjective area and is included as a guideline. It should be tailored to the components of each individual environment.

The meaning of the titles in this column is meant to provide a frame of reference as to how often it would be useful to interactively check the values of this statistic in a running system.

“mins” Check every few minutes as this directly monitors the service being offered to customers.

“hour” Check this once or several times an hour. This metric is useful for understanding the system usage and for capacity planning.

“day” It is sufficient on most systems to check this once a day.

General Linux Factors

The following indicators should be monitored on each machine in the system.


General System Services

The following items should be monitored for the whole system.

MTA Services

The following lists the areas that should be monitored on all machines that run MTA Servers.


POP Services

The following lists the areas that should be monitored on all machines that run Proxy Servers and Mail Box Servers.


IMAP Services

The following lists the areas that should be monitored on all machines that run Proxy and Mail Servers.


WebMail Services

The following lists the areas that should be monitored on all machines that run Proxy and Mail Servers.


MBS Services

The following lists the areas that should be monitored on all machines that run Mail Servers.


Directory Services

Both the replica and the master should be monitored.


MTA Queue Services

The following lists the areas that should be monitored on all machines that run MTA Servers.

Checking message queues and destinations

The queues in the system should be checked to ensure that there is no unusual buildup or backlogs of email. This will also identify if there have been local delivery issues, as messages will be queued rather than delivered to the Mail Server.

Large queues, and queue saturation as a result of spam/virus/worms is the number one operational support issue facing messaging systems. Automated probes as outlined above should alert support staff when thresholds are reached. Quick file system growth is usually an indication of message delivery problems.

If the queues are larger than expected, or contain domains that are not expected to have large amounts of queued messaged, this should be investigated. Depending upon the circumstances, investigation can be carried out in various manners. However, the following steps are recommended starting points for the investigation.

  1. Report this issue in the internal issue reporting system – so that it can be tracked and known about by the other operators.
  2. Use commands on the server that captures the unusual output to file. It should list a histogram of when the messages were received by the system and the reason for the messages being deferred. For Postfix MTA installations this can be found in the admin console.
  3. Check to see if any other issues such as NIO errors etc. are being reported.
  4. Another method is to search the zimbra.log or mailbox.log to find the reports of messages being deferred and these can be investigated to help identify the reason for the messages being deferred.
  5. If the reason for messages being denied seems to be because the peer system is not accepting message, then the following commands can be used to test that the remote domain is accepting messages.

Ensure that from the MTA machines the domain can be telnet’ed to on port 25 via the following command:

telnet domain 25

The response to this should be a 220 and then an SMTP banner. If this occurs then an SMTP session can be performed to ensure that the domain will accept messages from this domain; in other words the following commands can be sent. The items in italics should be returned from the peer in a normal operation:

helo example.com
250 text
mail from:<postmaster@example.com>
250 text
rcpt to:<postmaster@domain.com>
250 text
data
354 text
This is a test please ignore this message...
250 text
quit
220 text

Other tools and methods can be used to investigate this issue. However, the key is that abnormal activity is identified and an investigation is started as soon as possible.

Note: The MTA or ZCS system will normally clear queue’s itself and little operation level involvement is usually required. However, should an issue be encountered on a major ISP domain or it persists for a long time, other action may be required. In these cases, the normal Zimbra support escalation procedures should be used.

Per shift activities

The operators of the system should perform the following activities at least once every 8 hours.

Some of these activities may be performed automatically. However, the procedures to perform the operations are detailed below. Note: Even though some of the operations can be performed automatically the results need to be manually checked.

Scripts can contain detailed automated conditions so that alarms are raised when issues occur. However, it is recommended that the causes for the alarm are reported so that the first action of operators is to verify the reason for the alarm.

All investigation and commands specified in the section below should be performed as the user zimbra after ensuring that the zimbra profile command has been sourced in the shell.



Verified Against: ZCS 6.0 and 5.0 Date Created: 3/24/2010
Article ID: https://wiki.zimbra.com/index.php?title=ZCS_Operational_Best_Practices_-_Monitoring_and_Operational_Actions Date Modified: 2010-03-26



Try Zimbra

Try Zimbra Collaboration with a 60-day free trial.
Get it now »

Want to get involved?

You can contribute in the Community, Wiki, Code, or development of Zimlets.
Find out more. »

Looking for a Video?

Visit our YouTube channel to get the latest webinars, technology news, product overviews, and so much more.
Go to the YouTube channel »


Jump to: navigation, search