ZCS Operational Best Practices - Management Practices
||Operational Structure and Guidelines||
||ZCS Best Practices||
||Monitoring and Operational Actions||
Startup and Shutdown
It is important that startup and shutdown procedures are well documented and tailored to specific site environment standards. The Operations Manual should clearly define where the scripts reside, the normal process flow, and how to manipulate the components manually, if needed.
Equally important is the process and documentation around the database and any supporting processes. Startup, shutdown, and system dependencies should be clearly documented, outlining normal and manual process flow.
ZCS Key Performance Indicators
The recommended method of achieving a stable system is to identify and monitor a set of Key Performance Indicators (KPIs) and associated target values. Policies and procedures are then created to achieve these target levels. This involves first identifying and then monitoring with alarms the factors that impact these targets. Until the targets are defined and monitoring is in place, the quality of service is always an issue.
The aim is to set achievable but challenging goals. When stated goals are consistently met, it is appropriate to increase the target values. Some key goals may be missed in the first iteration. However, as time progresses these goals will become targeted and managed specifically to the needs of the site.
Periodically, if and when issues occur, these Key Performance Indicators (KPIs) are refined to focus specifically on the requirements of the site. This is also true of the monitoring policies which can also be refined to meet the specific site requirements.
This is a process of continuous improvement, consequently there is no such thing as a set of ‘perfect operational procedures’; operational policies are continuously improved and refined.
The remainder of this document details the most common improvement areas in a ZCS installation.
Areas to Target
The following areas are critical for most sites and are areas that require monitoring.
- Core ZCS accessibility
- MBS (Mail Box Server)
- MTA accessibility
- Proxy accessibility
- MTA response times for the initial banner and also for sending a message
- POP message retrieval time
- MTA round trip timings
- HTTP response times
- HTTP load times
- IMAP response and sync times
- Directory / Authentication services
In addition, there are items that can cause a failure and create issues in one of the areas listed above. These should also be monitored.
- Swap/Memory usage
- CPU usage
- Network Usage/Errors
- TCP connection Queues
- File System Free space
- Available File System Inodes
- Disk usage on all servers
- MySQL table/index sizes and growth
- MySQL index fragmentation (bloat)
- Disk I/O levels
- Numbers and destinations of messages queued
- Number and destinations of messages processed by the MTA’s
- Regular and correct backups and tidy up of temporary files
Best Practices recommend that all of the above be monitored and alarms be set at reasonable levels appropriate for the site. Document Section 4, Monitoring and Operational Actions, provides a recommended set of values and intervals to be monitored. Typically, the operations team at each site will have its own view on selecting the factors to be monitored and the monitoring frequency. Each operations team may also elect to add additional factors not listed herein. Such site-specific elections should be made with due reference to the recommendations contained in this document.
Each operations team will also determine site-specific definitions of desired monitoring target levels. Best Practice suggests that, in normal operation, no resource should be over 75% used.
Sources of Information and Alarms
There are a number of commercial and open source solutions that can be used to effectively monitor the integrity of host environments. For Example: HP Openview, Tivoli, Nagios, BMC Patrol can be used to monitor a ZCS installation and provide information using SNMP. These applications are very effective in providing alert and threshold notifications. By default these applications report on core level services of the operating / database environment. Unless tailored or customized, these tools do not provide insight into application level reporting. Within a ZCS environment, these monitors (if available) are used in conjunction with ZCS utilities, and/or OS provided commands to compile and monitor all the components of the environment. Best Practices dictate the use of a combination of all of the sources to acquire the operational statistics.
ZCS zmstat Utility
zmstat provides time-specific server monitoring. It checks the overall health of the ZCS system and reports status to the zimbrastats.csv log file. zmstat discovers events of a user-definable severity level. By customizing to optionally send an email message or page (or both) to a list of specified addresses, zmstat can be implemented and designed to help support near real-time monitoring in a 7 x 24 environment by monitoring the library of .csv files found in /opt/zimbra/zmstat.
OS / Linux Tools
The zmstat monitoring tool provides monitoring and alarms for some of the factors listed in Section 3.2. However, it does not check all of the factors. Monitoring agents or tools like vmstat, pmstat, along with other OS performance monitoring tools are needed to provide information about how the Linux machines are running. Linux commands and utilities can also be used to check the hardware and network.
ZCS Administrative Commands
The ZCS system provides command line utilities with a powerful command set. This command set allows operations teams to perform tasks such as:
- Starting and stopping ZCS servers/components
- Reporting statistics on server usage
- Retrieving account information
- Modifying account and mailbox (message store) information
- Analyzing and fixing corrupted messages
Administrative commands can be scheduled to run at various times throughout the day; capturing results to obtain trends and extended tracking. These commands are documented in the ZCS Administration Guide Appendix A, Command Line Utilities.
ZCS local configuration file
The ZCS configuration data offers a set of configuration keys that support a multitude of messaging environments. It is important that the local ZCS configuration is consistent across all hosts in the ZCS environment. This should be managed through Change / Configuration Management procedures, and Best Practices around the parameter formats within the zmlocalconfig database.
The configuration editing utility, zmlocalconfig, resides on all ZCS hosts, and can be used to view and/or modify the local ZCS configuration. Every entry in the configuration database has the following syntax:
zmlocalconfig –e <keyName>: [<value>]
ZCS has a mature logging system. Most of the site’s performance, failure and capacity information can be generated from these logs. ZCS generates the three kinds of log files listed in the table immediately below.
The log files are generated based on a configurable period of time, after which, the logs are rolled-over, and the server starts a new log. Depending on specific site standards, processes should be created to trim (delete) and/or archive old log files to ensure adequate disk space is maintained.
Log files are an important tool used to monitor the system. They are also very useful in capturing performance, growth, and diagnostic information, as they allow operations teams to retrace the steps that led to an event or problem. The following logs are available:
- mailbox.log - jetty mail services
- audit.log - authentication
- clamd.log - antivirus db
- convertd.log - attachment conversion
- freshclam.log - clam antivirus updates
- logger_myslow.log - slow logger db queries
- myslow.log - slow db queries
- spamtrain.log - spam/ham training
- sync.log - zimbra mobile
- zimbrastats.csv - server performance statistics
- zmconvertd.log - conversion server monitor
In addition to the ZCS Logging system, Linux system log files also provide significant information about issues that the system may be encountering. These should not be forgotten when monitoring systems.
API and tool generation
ZCS provides a ‘C’ and Perl API. With these API’s it is possible to build programs and scripts to automate many operations tasks.
ZCS uses MySQL as the database for the Message Store. At this point in time, only Sleepycat is supported for the Master Directory. Sleepycat database should only be accessed using the utilities supplied by LDAP. Most of the standard ZCS commands, such as zmprov, hide this and create a simple-to-use CLI.
ZCS documentation is extensive and in ZCS 6.0 is has been significantly improved and centralized. It is Best Practice for Operations teams to review the ZCS Manuals, wiki’s and support sites for information on the workings of ZCS. Zimbra is always improving and refining its product documentation. Complete published documentation sets are available through the customer portal.
Standard monitoring tasks
The Best Practice model has a series of automated monitoring tools, as well as a list of tasks or activities that should be performed periodically. Most of these manual activities consume little time and can be automated to some extent. However, the data obtained requires human interpretation to effectively identify and mitigate issues that occur. These monitoring tasks can be grouped into several lists as shown below, each to be performed at different intervals of time.
Note: Operations staff should review and modify these lists according to specific site requirements.
Automated systems and tools should be used to monitor any system where possible. The zmstat tool performs a series of periodical checks, based upon a cron. This tool will page and send email reports to operators.
Best Practice recommends the use of this tool in conjunction with the other system management and monitoring tools used on the system. Tools to monitor port availability such as Nagios should also be used. There are many Linux Operating System-monitoring tools available and some sites use their own solutions in this area. However, with automated checks there is a risk that the information will not be picked up and/or reviewed by operators, especially if they are provided with too much information. To contain this risk, these systems should work on an alarm basis, as zmstat does. Otherwise, important information may be inadvertently missed.
Per shift activities
These activities should be performed at least once a ‘shift’ by operators. Some of these activities may be performed automatically. Standard activities include the following actions:
- Read the email in the Zimbra admin, root and postmaster email accounts. Take appropriate action and/or further investigate the cause of these messages. Old invalid emails should be cleared.
- Check the MTA queue sizes, number of messages and the destinations.
- Other tools can be used to display this information, but regardless of the tool, the key is that abnormal activity is identified and an investigation is started as soon as possible.
- Check the graphs/information upon the MTA response times and message flows etc.
- This is to ensure the transactions rates since the last check have been within normal boundaries. Operators should review these stats periodically and have them displayed so that new issues can quickly be identified.
- Check the status of all the ZCS servers.
- Tools like the Zimbra admin console or other monitoring systems such as Nagios should be used to ensure that all the servers are running correctly on all the nodes.
- Check that no ZCS server restarted itself without operator knowledge.
- ZCS servers restart themselves automatically if they fail. The log files should be checked to ensure that any restarts are known about and identified. Simple scripts can be written to identify this type of occurrence.
- Check that all the directory replicas are still in sync with the master database.
- Rarely, a directory replica will become out of synchronization with the master. When this happens it should be identified and re-synched as soon as possible.
- Check the Error directories of the queue servers.
- These should be checked to ensure that only ‘normal’ email errors are occurring. A policy should also be in place to ensure that these are cleaned to ensure that they do not take up disk space.
- Check the free disk space on all the servers.
- The MTA spools and MBS message storage areas should especially be checked to ensure that the usage size is known.
At least once per day activities
These checks should be performed once a day to ensure that no issues have been encountered. The operations staff may perform some of these automatically or more often than once a day as required.
- Check machine CPU usage, disk IO and disk usage graphs for the last day. Compare those with previous days to ensure that they follow a normal and desired pattern. This can be done using the zmstat-chart utility and graphing.
- Check that backups were generated correctly, and all redo logs were transferred to another location and stored to disk for a longer term. Also ensure that the redologs archive and log trimming occurred on the local machine.
At least once per week activities
These checks should be performed once a week to ensure that no issues have been encountered. They may be performed more often if required.
- Check the MySQL table sizes and extents.
- Check for the existence of multiple errors of similar types in the mailbox.log. Investigate them if they exist.
- Check for database integrity and schedule operations to fix any issues where necessary.
- Ensure general log file and temporary file rotation is performed as expected.
- Java garbage collections should be working without error and in a relatively consistent manner.
The following tasks should also be performed at regular intervals. These do have an affect on the user community. Therefore, any potential impacts must be understood and communicated to the end users.
- Perform clean up of messages and mailboxes older than the retention periods.
- Perform clean up of Trash folder messages.
Field experience in large-scale ZCS environments has shown some architectural structure and planning of machine usages to be highly beneficial.
If possible, different servers should be deployed to handle accounts with different service level agreements (this usually relates to certain types of users or to CoS). This may even be carried through to suspended and deleted accounts, but that depends upon provisioning and the SLA agreements with respect to this type of account. Within reason, active accounts should be on high quality production mail store servers that reflect the service level agreements with the customers.
Low Penalty Servers
Having one MBS that contains a small number of active accounts, a Proxy server that is not in rotation, and perhaps a MTA server that is not in rotation, can be very beneficial. These servers can be used to field test new software, configuration changes, and field test procedures before trying them on heavily loaded 'production' servers. They can also be pushed immediately into full production should there be an emergency need. The MBS especially, may be loaded with ‘friendly’ users who can provide feedback when changes occur.
This may actually be a separate system that is automatically checked to ensure it is has a similar configuration as the production systems and has the same operations staff running it.
Depending on an individual site configuration, the Front-end servers (MTA, Anti-Abuse and Proxy) may be deployed in one of several load-balancing configurations (Paired, Pooling, DNS Round-Robin). Be aware that when the total throughput into the Front-end exceeds the maximum throughput of a single server then fail-over, and/or front end integrity may be lost.
Best Practice suggests reviewing and possibly adding more capacity to accommodate projected and future capacity demands with an N+1 setup as the minimum.
The core ZCS server or Mail Server’s capacity should be monitored carefully. When the total throughput of the all the MBS servers is greater than the total solution can handle then the integrity of the system is lost. Targets should be established to understand the requirements from a performance perspective. Capacity planning is of the utmost importance. This includes analyzing on a periodic basis user profiles and changing use patterns (see 3.5). Message storage, index and database disk space should be monitored and when they grow above a fixed target level, extra capacity should be added. Best Practice suggests reviewing and possibly adding more capacity to accommodate projected and future capacity demands.
Operational Best Practices and Suggestions
Best Practices recommend the following operating practices noted in sub-sections 184.108.40.206 through 220.127.116.11 immediately below.
Develop, maintain, and use documented checklists/methods of procedure for field operations. Require others who may impact or modify platform servers to do the same.
Do not rely on anyone to type a long sequence of commands during a field operation. Create a script to run the procedure. Test and document it.
When available, have the next release of software running in a lab or on a low penalty server.
Plan upgrades and keep on top of which patches exist and what they fix. It is not necessary to upgrade as soon as a patch or next version is created, but knowing about it, understanding the upgrade and having experience within the team allows for better planning and smoother operations.
Know the users who have access to the systems
Keep close track of who has access to the servers and manage access closely. Ensure that the access method used provides logs to ensure historic data can be found for whoever was on a server at a given time.
All mail servers, especially those providing Web mail, will be targets for attacks. Firewalls and other protective software should be used to reduce this risk and to ensure that if an attack occurs it is known about.
Cold backups of the installation servers should be done and be available should a machine have to be restored.
Experience and industry information has shown that the largest number of security issues originate from internal users rather than external sources. Security measures must be taken against attacks generated from the internal as well as external network.
Best Practice requires operations teams to periodically generate a profile of the average user. Note: There may be several profiles created, as there may be several types of users defined in different classes of service. Generating this type of information empowers the operator to make good decisions about operating factors in the system.
At a minimum, this profile takes into account the following factors:
- The number of messages the user stores in the system
- The frequency of access and the access type
- % of checks to empty mailbox and length of time a user is active
- Sent outbound mail - internal and external
- The distribution of user activity over the day/week
- The peak and non-peak traffic periods
Knowing how much traffic an MBS or Directory Server is burdened with is more useful than just knowing how many accounts or mailboxes are located on that server. The same process can and should be performed on the other machines in the system; however the profiles for these machines are less important. This allows better capacity planning and more informed business decisions. The following information should be generated to create an access profile for each server:
- Status and number of accounts on a system
- Number of messages per day - inbound and outbound
- The message size – min-max-median-average
- The number of recipients per message
- The distribution of message traffic over the day/week
- Peak and non-peak loads – CPU, Disk IO etc.
Users may encounter minor issues at a site but not report them through normal channels. Monitoring (but not responding to) user groups and forums can often improve an operator’s understanding of the user experience.
Care must be taken with this medium, as feedback can be misleading or not truly representative of the degree of error users are experiencing. In addition, great care should be taken in responding to users via this medium. However, these forums may still provide some valuable information about the general feel and quality of service provided to users. Other sources of information, such as the postmaster account and customer satisfaction surveys for example, have similar issues but can also provide invaluable information about what matters to users.