ZCS Operational Best Practices - Monitoring and Operational Actions

Zimbra Collaboration Operational Best Practices - Monitoring and Operational Actions

   KB 3525        Last updated on 2015-08-3  




0.00
(0 votes)

This section details the actions that need to be performed as recommended in the above sections. These recommendations contain the exact commands and methods of procedure to perform these operations. However, in many cases the expected results for the operations are not detailed. Since there are many variables in how a system may be configured and what equipment is deployed, each site’s acceptable values must be established by their respective operations teams. In most cases a ‘reasonable’ value has been detailed in this document.

This section also documents the minimum level of operational activities and ‘key performance indicators’ (KPIs) that should be performed to manage and operate the ZCS systems. All the items in Section 3 must be addressed to achieve a ‘Best Practice’ level of operations; the items specified herein are the very minimum acceptable levels.

Minimum Key Performance Indicators

If a starting point for the Key Performance indicators is required for the site, then the following factors are the smallest list of items that should be monitored for the site. These will provide the ‘basic’ level of service figures about user accessibility to the site.

  • Core ZCS accessibility
    • MBS (Mail Box Server)
    • POP
    • HTTP
    • IMAP
  • MTA accessibility
  • Proxy accessibility
    • IMAP
    • POP
    • HTTP
  • MTA response times for the initial banner and also sending a message
  • POP message retrieval time
  • MTA round trip timings
  • HTTP response times
  • HTTP load times
  • IMAP response and sync times
  • Directory / Authentication Services / Provisioning

This information can be obtained via interactive monitoring scripts, which is the recommended method. It can also be obtained afterwards by processing the server log files.

Minimum Automated System monitoring

At a minimum, the logger should be identifying critical events in conjunction with SNMP traps to monitor the ZCS server, disk space and the MySQL database. Email message and pager alarms should be triggered from this utility.

The log monitoring only provides a report on whether the system is running and about issues only within the time window in which it was run. Although the time period can be managed, it is important to note however, that this is a monitor approach.

Thus, to provide the Best Practice solution, extra-automated systems should be used. These automatically test an aspect of the user experience in the system. These extra items should be targeted at particular aspects of the system and gather the required statistics. A good policy is to have a monitor for each of the Key Performance Indicators (KPIs) that are to be monitored.

All scripts should log and/or make the information available for historical and trend gathering purposes. It is recommended that the information be presented in a graphical manner, as this is the easiest format for operators to visually inspect. Obviously, alarms and reports should also be triggered if the service levels are abnormal.

With this in mind the following monitoring processes are recommended. These processes should be triggered often, for example, every several minutes. The exact rate depends upon the site.

  1. Automated scripts should send a predefined message to several mail accounts (at least one on each of the active Mail Servers in the system). The time taken for the initial banner to be returned and then for the message to be sent should be checked.
    These or other scripts should then retrieve the sent message via the pop, imap and/or http proxy servers. The initial banner message and the total time for the operation should be checked.
    If the time for any of the operations is above the level acceptable for the site, service level alarms should be triggered and on duty operators informed that an issue is occurring.
  2. An automated script should also access an account on each MBS via HTTP. This script should request the default login page, login into the interface and log out. Basic checks must be performed upon the returned pages to ensure the operations were successful.
    Also as with the other services the timing for the operations should be checked and if they are outside acceptable limits, service level alarms should be triggered and on duty operators informed that an issue is occurring.
  3. An automated script should also interface into the provisioning system. This should change a value (the password for example) on a special test account. Checks should then be made to ensure that the change has been made for the users account.
    This means that at least one provisioning change will occur in any period. This can be useful for other factors and the update of the change to all the directory caches in the system can also be checked. If the change is not pushed to all directory caches in the expected time period the service level alarms should be triggered and on duty operators informed that an issue is occurring.

Additional Automated System Monitoring

Sections 4.1 and 4.2 above list a minimum set of operational monitoring factors and processes. These allow problems to be identified quickly and monitor the system to ensure that a good level of service is being delivered to end customers.

However, most the above actions are all reactively monitoring the system. To employ Best practices, the system should also be monitored proactively in order to identify and possibly mitigate issues that may occur. Thus, scripts and/or probes should be created to monitor the factors outlined below.

Some of the metrics below are more useful than others depending on each individual site, and not all sites need to monitor all factors. This list is not definitive because, as stated in Section 3.2, other factors may be identified to be equally or more important depending on the requirements of a specific site.

It is recommended that the monitoring results be processed into a visual/graphical format. The graphs and/or tables should contain either an optimal or expected plot so that the current data can be easily interpreted. This is to ensure that the values can be easily interpreted to be within normal boundaries.

Trend analysis is also recommended on most of these factors. This information can be used to create the User and Machine profiles for the site. Because of these factors some of this information can be obtained from the historic value in the server .log and .stat files.

The ‘Recommended Period between checks’ is shown in the tables below. This is a very subjective area and is included as a guideline. It should be tailored to the components of each individual environment.

The meaning of the titles in this column is meant to provide a frame of reference as to how often it would be useful to interactively check the values of this statistic in a running system.

“mins” Check every few minutes as this directly monitors the service being offered to customers.

“hour” Check this once or several times an hour. This metric is useful for understanding the system usage and for capacity planning.

“day” It is sufficient on most systems to check this once a day.

General Linux Factors

The following indicators should be monitored on each machine in the system.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
Free memory Linux kernel X Every ___ hours
Used memory Linux kernel X Every ___ hours
Amount of swapping Linux kernel X Every ___ minutes
Amount of paging Linux kernel X Every ___ minutes
Total CPU usage Linux kernel X Every ___ minutes
Amount of CPU used for tasks Linux kernel X Every ___ minutes
Network link availability External probe X Every ___ minutes
Network packet turn-around times External probe X Every ___ minutes
Network errors Linux kernel X Every ___ hours
Network traffic rates Linux kernel X Every ___ hours
Free file system space Linux kernel X Every ___ days
Free file system space Linux kernel X Every ___ days
Available inodes in a file system Linux kernel X Every ___ days
Disk usage Linux kernel X Every ___ minutes
Disk errors Linux kernel X Every ___ minutes
Hardware errors Linux kernel X Every ___ days
Defunct processes Linux kernel X Every ___ hours
Syslog and messages Linux kernel X Every ___ minutes

General System Services

The following items should be monitored for the whole system.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
Mail Relay Roundtrip Time External probe X Every ___ minutes
DNS Availability External probe X Every ___ minutes
DNS Query Time External probe X Every ___ minutes
Intra system network availability External probe X Every ___ minutes
Intra site network availability External probe X Every ___ minutes
Configuration update errors Log file X Every ___ hours
Data Center Environment (Temp) External probe X Every ___ days

MTA Services

The following lists the areas that should be monitored on all machines that run MTA Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
MTA Availability External probe X Every ___ minutes
Connect Response Time External probe X Every ___ minutes
SMTP Transmission Time External probe X Every ___ minutes
Active Connection Count Kernel - netstat X Every ___ minutes
Cumulative Connection Stat file X Every ___ hours
Deferred Message Count Log file X Every ___ hours
Processed Message Count Stat file X Every ___ hours
Stored Message Count File system X Every ___ hours
Received Volume Stat file X Every ___ days
Transmitted Volume Log file X Every ___ days
Delivered Message Count Stat file X Every ___ hours
Received Message Count Stat file X Every ___ hours
Amount of Memory Used Stat file X Every ___ hours
Server Start Time Log file X Every ___ hours
Bounced Message Count Stat file X Every ___ hours
Thread Usage Log file X Every ___ hours
Server Log Log file X Every ___ minutes

POP Services

The following lists the areas that should be monitored on all machines that run Proxy Servers and Mail Box Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
POP Availability External probe X Every ___ minutes
Connect Time External probe X Every ___ minutes
Authenticate Time External probe X Every ___ minutes
Query Time External probe X Every ___ minutes
Download Time External probe X Every ___ minutes
Active Connection Count Kernel - netstat X Every ___ hours
Total Connection Count stat/log file X Every ___ hours
Failed Connection Count stat/log file X Every ___ hours
Rejected Connection Count stat/log file X Every ___ days
Timed-out Connection Count stat/log file X Every ___ days
Retrieved Message Count stat/log file X Every ___ hours
Failed Retrieved Msg Count stat/log file X Every ___ hours
Average Message Age stat/log file X Every ___ days
Server Start Time Log file X Every ___ hours
Amount of Memory Used ps command X Every ___ hours
Server Log Log file X Every ___ minutes
Null Connection Count Log file X Every ___ days
% Unique Connection Count Log file X Every ___ hours

IMAP Services

The following lists the areas that should be monitored on all machines that run Proxy and Mail Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
IMAP Availability External probe X Every ___ minutes
Connect Time External Probe X Every ___ minutes
Authenticate Time External Probe X Every ___ minutes
Query/Select Time External probe X Every ___ minutes
Download Time External probe X Every ___ minutes
Logout Time External probe X Every ___ minutes
Active Connection Count Kernel X Every ___ hours
Total Connection Count Stat/log file X Every ___ hours
Failed Connection Count Stat/log file X Every ___ hours
Rejected Connection Count Stat/log file X Every ___ hours
Timed-out Connection Count Stat/log file X Every ___ hours
Fetch count Log file X Every ___ days
Delete count Log file X Every ___ days
Search count Log file X Every ___ days
Server Start Time Log file X Every ___ hours
Amount of Memory Used ps command X Every ___ hours
Server Log Log file X Every ___ minutes
Null Connection Count Log file X Every ___ days
% Unique Connection Count Log file X Every ___ hours

WebMail Services

The following lists the areas that should be monitored on all machines that run Proxy and Mail Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
WebMail Availability External probe X Every ___ minutes
Connect Time External Probe X Every ___ minutes
Authenticate Time External Probe X Every ___ minutes
Select Time External probe X Every ___ minutes
Download Time External probe X Every ___ minutes
Compose and Send Time External probe X Every ___ minutes
Logout Time External probe X Every ___ minutes
Server Start Time Log file X Every ___ hours
Amount of Memory Used ps command X Every ___ hours
Server Log Log file X Every ___ minutes
Null Connection Count Log file X Every ___ days
% Unique Connection Count Log file X Every ___ hours

MBS Services

The following lists the areas that should be monitored on all machines that run Mail Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
MBS Availability ZCS Utility X Every ___ minutes
Received Message Count Stat/log file X Every ___ minutes
Received Message Volume Stat/log file X Every ___ minutes
Stored Message Count Stat/log file X Every ___ minutes
Stored Message Volume Stat/log file X Every ___ minutes
MBS Start-up Time Log file X Every ___ minutes
Amount of Memory Used ps command X Every ___ minutes
MySQL table/index sizes mysql X Every ___ hours
MySQL index check mysql X Every ___ hours
MySQL Alert Log mysql/log X Every ___ minutes
Server Log Log file X Every ___ minutes
% Unique Connections (POP-HTTP-IMAP) Log file X Every ___ hours
Average Mailbox Size in MB Script X Every ___ days
Average Message Size ZCS Tool X Every ___ days
Messages per MBS Script X Every ___ days
Histogram Message Size ZCS Tool X Every ___ days
Histogram Msg per Mailbox Script/ZCS Tool X Every ___ days
Histogram Mailbox Last Login Script/ZCS Tool X Every ___ days
Java garbage collection Zmstat/ZCS Tool - jstat X Every ___ hours

Directory Services

Both the replica and the master should be monitored.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
Total Processed Operation Count Stat/log file X Every ___ minutes
Failed Operation Count Stat/log file X Every ___ minutes
Successful Operation Count Stat/log file X Every ___ minutes
Number of changes committed Stat/log file X Every ___ minutes
Number of changes failed Stat/log file X Every ___ minutes
Unknown User Queries Count Stat/log file X Every ___ hours
Bad Password Queries Count Stat/log file X Every ___ hours
Read through Failed Stat/log file X Every ___ hours
Cached Item locks Stat/log file X Every ___ hours
Updates processed Stat/log file X Every ___ hours
Table/index sizes LDAP X Every ___ days
Synchronization LDAP X Every ___ days
Alert Log LDAP X Every ___ minutes
Server Log Log file X Every ___ minutes
Last Change Number Log file X Every ___ hours
LDAP Availability External probe X Every ___ minutes
LDAP Connect Time External probe X Every ___ minutes
LDAP Authenticate Time External probe X Every ___ minutes
LDAP Query Time External probe X Every ___ minutes
Server Start Time Log file X Every ___ hours
Amount of Memory Used ps command X Every ___ hours

MTA Queue Services

The following lists the areas that should be monitored on all machines that run MTA Servers.

Metric Statistic Source or Monitoring Tool Recommended Period between Checks Site Interval
Minutes Hours Days
Queue Disk utilization df-k X Every ___ minutes
Deferred (Outbound) Queue Size fastls/ls commands X Every ___ minutes
Error Queue Size fastls/ls commands X Every ___ minutes
No. Sidelined Messages fastls/ls commands X Every ___ minutes
Message (Inbound) Queue Size fastls/ls commands X Every ___ minutes
No. Domains in Queue fastls/ls commands X Every ___ minutes
Largest Outbound Queues fastls/ls commands X Every ___ minutes
Server Log Log file X Every ___ minutes
Accumulated Connection Stat file X Every ___ hours
Server Start Time Log file X Every ___ hours
Amount of Memory Used ps command X Every ___ hours

Checking message queues and destinations

The queues in the system should be checked to ensure that there is no unusual buildup or backlogs of email. This will also identify if there have been local delivery issues, as messages will be queued rather than delivered to the Mail Server.

Large queues, and queue saturation as a result of spam/virus/worms is the number one operational support issue facing messaging systems. Automated probes as outlined above should alert support staff when thresholds are reached. Quick file system growth is usually an indication of message delivery problems.

If the queues are larger than expected, or contain domains that are not expected to have large amounts of queued messaged, this should be investigated. Depending upon the circumstances, investigation can be carried out in various manners. However, the following steps are recommended starting points for the investigation.

  1. Report this issue in the internal issue reporting system – so that it can be tracked and known about by the other operators.
  2. Use commands on the server that captures the unusual output to file. It should list a histogram of when the messages were received by the system and the reason for the messages being deferred. For Postfix MTA installations this can be found in the admin console.
  3. Check to see if any other issues such as NIO errors etc. are being reported.
  4. Another method is to search the zimbra.log or mailbox.log to find the reports of messages being deferred and these can be investigated to help identify the reason for the messages being deferred.
  5. If the reason for messages being denied seems to be because the peer system is not accepting message, then the following commands can be used to test that the remote domain is accepting messages.

Ensure that from the MTA machines the domain can be telnet’ed to on port 25 via the following command:

telnet domain 25

The response to this should be a 220 and then an SMTP banner. If this occurs then an SMTP session can be performed to ensure that the domain will accept messages from this domain; in other words the following commands can be sent. The items in italics should be returned from the peer in a normal operation:

helo example.com
250 text
mail from:<postmaster@example.com>
250 text
rcpt to:<postmaster@domain.com>
250 text
data
354 text
This is a test please ignore this message...
250 text
quit
220 text

Other tools and methods can be used to investigate this issue. However, the key is that abnormal activity is identified and an investigation is started as soon as possible.

Note: The MTA or ZCS system will normally clear queue’s itself and little operation level involvement is usually required. However, should an issue be encountered on a major ISP domain or it persists for a long time, other action may be required. In these cases, the normal Zimbra support escalation procedures should be used.

Per shift activities

The operators of the system should perform the following activities at least once every 8 hours.

Some of these activities may be performed automatically. However, the procedures to perform the operations are detailed below.

Note: Even though some of the operations can be performed automatically the results need to be manually checked.

Scripts can contain detailed automated conditions so that alarms are raised when issues occur. However, it is recommended that the causes for the alarm are reported so that the first action of operators is to verify the reason for the alarm.

All investigation and commands specified in the section below should be performed as the user zimbra after ensuring that the zimbra profile command has been sourced in the shell.

Checking ‘standard’ accounts

Using either a special email client, or other mail reading software the Zimbra admin, root and postmaster email accounts should be read daily. This assumes that all the root and zimbra email on the machines in the ZCS installation are automatically forwarded to the ZCS or Admin account.

The site should have a retention policy to ensure that all emails sent to these accounts can be retained.

The following actions should be taken towards the various emails in the system:

  • Emails that are reporting customer problems or complaints should be forwarded to the correct internal systems
  • Emails reporting an issue on internal systems (backups or cron jobs creating errors etc) should be raised in the correct and same manner as operational issues. These should then be investigated.
  • Emails from other site operators reporting potential issues should again be followed up and processed as required by the sites rules.
  • Junk and Spam email should be deleted.

Checking Key Performance Indicator (KPI) statistics

The results of tools that check the responsiveness and other statistics of the mail platform should be visually inspected to ensure that no abnormal values are being reported. Depending upon the monitoring tools being used, and the values being monitored, this can take a variety of forms and display a wide range of information.

However, at a minimum the factors detailed in Section 4.1 should be inspected, as these are factors that are directly related to the user experience in the system.

Operators should review these stats and have them displayed in the background so that new issues can quickly be identified. Automated systems can be put in place to generate alarms if abnormal traffic peaks occur, so investigation of new issues can be performed.

The investigation actions of a new issue will depend upon the exact issue. However, the following steps are recommended as a starting point for the investigation:

  1. Report the issue in the internal issue reporting system – so that it can be tracked and known about by the other operators.
  2. Check to see if all the servers running - See Section 4.4.4 for details.
  3. Check to ensure that no server has restarted – See Section 4.4.5 for details
  4. Check the file systems on the system where is the issue is occurring to ensure that they have space - See Section 4.4.7 for details
  5. Check the processes that are running on the system and the resources that they are using. This can be achieved by using the top or sar commands.
  6. Check the log file of the service that is having a problem. The output should be inspected to see if any unusual messages have been reported which could affect the operation of the system.
  7. If the issue is that a larger number of operations have been performed or that the server is responding slower than is acceptable, then investigation should be made to see if the service is under attack. Further investigation, including the number of open connections, and repetitions of sender addresses or source IP addresses should be looked for.

Checking status of all ZCS servers

The status of all the ZCS servers will need to be checked. At a minimum, the status of the servers should be checked using the tools detailed in this section. This method should be used, if possible, to ensure that all the servers are responding on the correct ports.

Checking that no server has restarted

Because ZCS processes can auto restart, depending upon the configuration of the system, manual checks should be performed to ensure that that no unexpected restarts have occurred.

The following command is another alternative to the command provided in the Administration documentation, and these should be used in conjunction. This shows the start time for the current running process.

$ ps -o 'stime comm' -u zimbra

Checking Directory Replica Synchronization

Although rare, it is possible for a directory replica to drop out of sync with the master database. This is a concern, as the directory replica will not take provisioning changes. One method to check this is to compare the change number of the master database and all the replicas. The following command provides a timestamp for each ldap server queried:

$ldapsearch -xLLLH ldap://<server>:<port> -s base contextCSN

If there are significant out of sync indications it is usually because the events happen when change logs have expired on the Master Directory Server before a Replica Server has applied them locally. This event usually occurs because the update thread has fallen behind on the Directory Replica Server due to Directory Replica Server loading, or because the Replica Server has been halted for a time period longer than the Master Directory Server's sync expiration interval.

Checking the log messages which are being generated

Once per shift one MTA, one Proxy, Directory replica and MBS should be checked to ensure that no unknown and/or unusual error messages have been generated. This process should be rotated around the various servers in the system over time. (see Zimbra Admin Guide, Monitoring Zimbra Servers)

To look for mail delivery problems:

$ cat mailbox.log | grep –i “exception occurred”

To look for events that happened during the progress of an activity:

$ cat mailbox.log | grep –i “handler exception”

The reported errors should be checked to ensure that all the log messages are expected and in the ratios that are understood as acceptable in the system. For example, the ratio of unknown address compared to valid addresses should be known and thus this can be compared to the values being shown in the current log file.

Unusual log messages or number of messages should be used as a starting point for investigation. The reported log message and the documentation about that error will point to the place to start the investigation.

Checking Free Disk Space

The zmstat utility creates log entries in the $ZIMBRA/zmstat directory. Various servers also create files in directories in the system. The MySQL system that ZCS uses also created files in various other file systems.

In most systems there are automated scripts that copy (as required) and then clean temporary file systems.

However, for various reasons the file systems on the servers can start to fill up. Thus, it is recommended that the file systems on the servers be monitored to ensure that they are not growing without knowing the cause. Tools like zmstat perform checks on the ZCS related file systems but general checks should be performed to ensure that no file system fills up.

If a file system’s usage changes dramatically in a short period of time then the cause for this should be investigated. The following command shows the files that have changed since the date specified:

$ find –mtime +1 –print

Also, the command zmlocalconfig can be used to set thresholds for warning and critical alerts.

Per Day Activities

These checks should be performed at least once a day to ensure that no issues have been encountered and to ensure that the systems are running as expected.

Analysis of the daily KPI factors

The statistics that are generated from the tools that check the factors listed in Section 4.3 should be checked at least once a day. These should be visually inspected to ensure that no abnormal values were generated. Depending upon the monitoring tools being used, and the values being monitored and checked, this can take a variety of forms and display a wide range of information.

If possible the values should be compared with the expected values and if discrepancies occur, then these should be investigated to find the cause. This allows the expected values to be modified and also allows the operations team to understand any changes in user behavior. This information is also useful for trend and peak capacity planning.

At a minimum checks should be made on the machine CPU usage, disk IO, user access and message flows for the last day. Compare those with previous days to ensure that they follow the normal pattern. ZCS commands like the ones below can be used to generate a basic level of information. However, further analysis of the server log files can generate more specific information.

Checking the Backup Process

A key component in a reliable, system is the state of the backup systems. The system used for backing up the system should be checked to ensure that the backup process has been successful and has stored the data to its correct location.

Note: Global notification emails that report success are not recommended as they are easily ignored. However, notifications should be sent to users who are responsible for the backup and recovery. They have a vested interest in ensuring that good backups are performed, as they may have to use them.

Per Week Activities

The checks below should be performed once a week to ensure that no issues have been encountered. They may be performed more often if required.

The actions in this section should be performed in off peak times, as some of the commands use extra system resources and thus may slightly affect system performance.

Check MySQL and File System Sizing

Within any database environment, over time the database may grow or become fragmented. Either of these situations can cause performance and/or service issues.

1. The zmdbintegretyreport should be run weekly or manually at a given time (dependent on the site requirements) on the MBS (Mailbox Server) system. This will report any corruption issues.

$ zmdbintegretyreport

2. Below are the main items you can use to monitor your MySQL database performance:

  • mysqladmin extended (absolute values)
  • mysqladmin extended -i10 -r (relative values)
  • mysqladmin processlist
  • mysql -e "show innodb status"
  • MySQL error log
    InnoDB tablespace info
  • mysqladmin extended (absolute values)
    The most common values to monitor are:
    • Slave_running: If the system is a slave replication server, this is an indication of the slave's health.
    • Threads_connected: This shows the number of clients currently connected. This should be less than some preset value (like 200), but you can also monitor that it is larger than some value to ensure that clients are active.
    • Threads_running: If the database is overloaded you get an increased number of queries running. That also should be less than some preset value (20?). It is OK to have values over the limit for very short times. Then you can monitor some other values, when the Threads_running was more than the preset value and when it did not fall back in 5 seconds.
  • mysqladmin extended (counters)
    The idea is that you store the performance counter value and compute the difference with the new values. The interval between the recordings should be more than 10 seconds. The following values are good candidates for checking:
    • Aborted_clients: The number of clients that were aborted (because they did not properly close the connection to the MySQL server). For some applications this can be OK, but for other applications you might want to track the value, as aborted connects may indicate some sort of application failure.
    • Questions: Number of queries you get per second. Also, its total queries, not number per second. To get number per second, you must divide Questions by Uptime.
    • Handler_*: If you want to monitor low-level database load, these are good values to track. If the value of Handler_read_rnd_next is abnormal relative to the value that you normally would expect, it may indicate some optimization or index problems. Handler_rollback will show the number of queries that have been rolled back. You might want to investigate them.
    • Opened_tables: Number of table cache misses. If the value is large, you probably need to increase table_cache. Typically you would want this to be less than 1 or 2 opened tables per second.
    • Select_full_join: Joins performed without keys. This should be zero. This is a good way to catch development errors, as just a few such queries can decrease the system's performance.
    • Select_scan: Number of queries that performed a full table scan. In some cases these are OK, but their ratio to all queries should be constant. If this value is growing, it could be a problem with the optimizer, lack of indexes or some other problem.
    • Slow_queries: Number of queries longer than --long-query-time or that are not using indexes. These should be a small fraction of all queries. If it grows, the system will have performance problems.
    • Threads_created: This should be low. Higher values may mean that you need to increase the value of thread_cache or that the number of connections is increasing, which also indicates a potential problem.
  • mysqladmin processlist or "SHOW FULL PROCESSLIST" command
    You can get the number of threads that are connected and running by using other statistics, but this is a good way to check how long queries that are running take. If there are some very long-running queries (e.g. due to being badly formulated, the admin should be informed. You might also want to check how many queries are in the "Locked" state. These are not counted as running but are inactive, i.e. a user is waiting on the database to respond.
  • mysql -e "SHOW INNODB STATUS"
    This statement produces a great deal of information, from which you can extract the parts that interest you. The first thing you need to check is: "Per second averages calculated from the last xx seconds". InnoDB rounds stats each minute.
    • Pending normal aio reads: These are InnoDB IO request queue sizes. If they are bigger than 10-20 you might have some bottleneck
    • reads/s, avg bytes/read, writes/s, fsyncs/s: These are IO statistics. Large values for reads/writes will mean the IO subsystem is being loaded. Proper values for these depend on your system configuration.
    • Buffer pool hit rate: The hit rate also depends on your application. Check your hit rate when there are problems.
    • inserts/s, updates/s, deletes/s, reads/s: These are low level row operations that InnoDB will do. Use these to check your load if it is in expected range.
  • MySQL error log
    Nothing should be written to the error log after the server has completed its initialization sequence, so everything appearing in the log should be brought to the admin’s attention immediately.
  • InnoDB tablespace info
    With InnoDB the only danger is that the tablespace becomes full - the logs can't get full. The best way to check this is to use the following command: show table status
    You can use any InnoDB table for monitoring the InnoDB table space free space.

Ensure cleanup process is working

It is recommended that at least once a week, all the machines are checked to ensure that application log file plus redolog archives have been trimmed as the site policy dictates. This check should also check the syslog to ensure that no system level issues have been encountered.

Other Activities

The following tasks should also be performed at regular intervals. These do affect users and thus the impact of them must be understood and communicated to the end users.

Mailbox and Message Aging

Depending upon the space needs and the system policies, old messages and mailboxes on the MBS system should be removed periodically.

Refer to the ZCS documentation for details of how to accomplish this, and if questions persist, they should be addressed to Zimbra Technical Product Support.

Perform Cleanup of the Spam and Trash Folder

A common occurrence is that ZCS users delete or move messages into the trash folder. However they do not actually delete the messages. This means that these deleted messages can be retained in the Mail Box Server for long periods. They also use up the customer’s mail quota without the user being aware of this.

Note: The email retention policy for spam and trash folder is set in the administration console. However, purging messages is done one mailbox at a time. That is the purge thread runs continuously, purging messages that have reached the lifetime retention threshold on a per mailbox basis. If there are a significant number of mailboxes per server, make sure you set your email retention lifetime long enough so that there is enough time to run through all the mailboxes before the first mailbox’s threshold is met again.


Verified Against: ZCS 8.0, 7.0 and 6.0 Date Created: 3/24/2010
Article ID: https://wiki.zimbra.com/index.php?title=ZCS_Operational_Best_Practices_-_Monitoring_and_Operational_Actions Date Modified: 2015-08-03



Try Zimbra

Try Zimbra Collaboration with a 60-day free trial.
Get it now »

Want to get involved?

You can contribute in the Community, Wiki, Code, or development of Zimlets.
Find out more. »

Looking for a Video?

Visit our YouTube channel to get the latest webinars, technology news, product overviews, and so much more.
Go to the YouTube channel »

Jump to: navigation, search