Performance Tuning Guidelines for Single-Server 100-500 User Systems
This document is under construction, started by Mark Stone of Reliable Networks on 6/12/08. Please give us until 6/30/08 to complete version 1.0 of this document, after which we welcome editing improvements to this wiki page. Until then, please email comments to me at firstname.lastname@example.org. Once this page is complete I will remove this paragraph and post a notice in the forums.
Reliable Networks started using Zimbra NE with version 4.0.3 and since then we have built a number of Zimbra systems for clients. We also have built a number of Zimbra systems for ourselves in our capacity as a Zimbra hosting partner. When we migrated systems to version 5, we found that the majority of our Zimbra 4.x tweaks to improve performance and reliability were no longer valid. Indeed, in some cases the tweaks which worked so well on Zimbra 4.x caused performance degradations on Zimbra 5 systems.
We recently successfully closed a support ticket covering one such system. In working with Zimbra support, we were able to get a more intimate understanding of a number of the "under-the-hood" differences between 32-bit Zimbra 4.x and 64-bit Zimbra 5.
We thought it would be helpful to share our collective experience with the community by documenting our (now updated for Zimbra 5) performance tuning best practices for single-server systems serving 100-500 users, which represent the majority of the systems we build and maintain.
A few caveats before we get started:
- We use SuSE Linux Enterprise Server 10 ("SLES10") exclusively. It's not that we don't like other distros (we do), but we wanted to standardize on one to make administration of multiple systems easier. As a result, some of the information here will be SuSE-specific. We'll try to point that out when we can, and would ask that users of other distros edit this wiki page with their distro-specific information where appropriate.
- We use HP hardware exclusively, for the same reasons we use SLES10. Again, we would ask users of other hardware platforms to edit this wiki page with their hardware-brand-specific information where appropriate.
- We use 64-bit hardware and software exclusively. PAE works fine for using more than 4GB of RAM on a 32-bit system, but when a system does have more than 4GB of RAM, using 64-bit hardware and software we find simplifies things.
- The way we suggest doing things may not be the only appropriate nor optimal way. Our suggestions for partitioning for example are just one way of doing things. There are other ways of partitioning that will give comparable results.
- Making use of any of the suggestions in this document is solely at your own risk. By making use of any of the suggestions in this wiki article, you acknowledge and agree that Reliable Networks (and any subsequent editors of this wiki article) have no liability to you nor to your customers, agents or employees, for any losses or damages suffered as a result of your incorporating any of the suggestions in this wiki article within any Zimbra system you administer.
If you are OK with all of the above, then let's get started.
The sections below provide a general discussion of each one of the topic areas. Specific steps to implement are then grouped together in one section towards the end.
Strategies and Objectives
Most of the tweaks in this article are driven the following strategies and objectives:
- "Right-Size" the server hardware. In our view, critical application servers should be life-cycled over five years and backed by a hardware support agreement. If the client is a fast-growing company, the hardware platform that is appropriate now will likely not be adequate two years from now, so the hardware platform needs to be chosen with expandability in mind. If the company is stable, more conservative hardware choices can be made.
- Manage by exception. Once configured, the system should not require a lot of attention and should be configured in a way such that attention is required only when changes are required or there is a system fault. Break/fix work on a mission-critical system should be avoided as much as practicable. In practice, this means very intimate monitoring of the system well beyond Zimbra's built-in monitoring.
- Build for Xen. Xen does not support NPTL right now, but Xen paravirtualization offers many advantages for backups and disaster recovery, so we do everything we can to ensure that when Xen is ready for Zimbra we will be able to P2V (physical-to-virtual conversion) a Zimbra server seamlessly.
- Maximize end-user perceived performance. Anything that can be done to make the web interface snappier will make users happier. For example, Zimbra is very disk intensive, so we do a number of things to reduce disk I/O. A less busy system means the web interface will be snappier.
- Keep costs low. It's tempting to just throw hardware at a performance bottleneck, but with mail servers in general it is often cheaper to spend a few cycles on configuration optimization to save four figures on hardware.
Pre-Deployment Hardware and Anti-Spam/Virus/Malware Optimizations
Like most mail servers, Zimbra is very disk intensive. The mail flowing through Postfix > Amamvis > ClamAV > Spamassassin > Amavis > Postfix > Zimbra store is the primary daytime load on disks. Backups also hammer disks, as do Java garbage collection routines and MySQL database updates and reindexing. Investments in reducing disk I/O therefore will have a very high ROI (return on investment) in terms of increased performance.
To reduce disk I/O bottlenecks, we employ a three-pronged strategy:
- Offload initial Anti-Virus/Anti-Spam/Anti-Malware work to a separate, dedicated device. Since 80% or more of all email traffic is garbage, to the extent you can save your Zimbra server the trouble of filtering out most of this, you will reduce the load on your Zimbra server by 75% or more. Lower loads mean less powerful hardware is needed, and you can apply that savings to a pre-filtering device.
- SonicWall TZ and PRO series devices with the Enhanced OS and the Gateway IPS/AV/AS license do RBL checking in addition to Anti-Virus/Anti-Spam/Anti-Malware. The annual license cost for this is a few hundred dollars (depending on the model), and typically enables us to shave $1,500 or more off the cost of server hardware, thereby more than paying for itself over the life of the server. This calculation does not include the incremental benefits of doing Anti-Virus/Anti-Spam/Anti-Malware at the network gateway instead of just at each device, which if included would only increase the ROI.
- Spamassassin uses RBLs to add to a mail's spam score, whereas doing RBL checks at the gateway is binary and will result in "false positives" (legitimate email incorrectly identified as spam) if you use aggressive RBLs on the gateway. If you use one or more conservative RBLs on the gateway, you will virtually eliminate false positives at the gateway. As of this writing, zen.spamhaus.org is considered a comprehensive and conservative RBL. Note that a license from Spamhaus is required in certain circumstances.
- Put Amavis's temp directory on a ram disk. Zimbra used to do this by default, but if it's not done exactly right mail stops flowing, so Zimbra took this out of the installation scripts. Since Amavis's management of Clam and Spamassassin processing is the single biggest hog of daytime disk I/O, this step, combined with pre-filtering above, will result in the biggest performance-bang-for-buck gains possible.
- Is it safe? Dustin Hoffman didn't know, but in the case of Amavis the answer is "Yes." The reason is based on Postfix being so modular; the portion of Postfix which hands off emails to Amavis for processing is blissfully unaware of the other portion of Postfix which accepts reinjected messages from Amavis, after Amavis has finished shepherding the emails through Spamassassin and ClamAV. The actual event sequence is as follows:
- Postfix hands off an email to Amavis for processing on port 10024.
- Amavis accepts the message, but doesn't tell Postfix it has accepted the message just yet.
- Amavis processes the message through ClamAV and Spamassassin, then reinjects the message back into Postfix on port 10025.
- Once Postfix on port 10025 confirms to Amavis that it has accepted the (cleaned up) message, Amavis then tells Postfix on port 10024 that Amavis has successfully accepted the initial (potentially unclean) version of the message.
- As a result, email messages always live in Postfix until final delivery to the Zimbra store, and thus if the ram disk blew up, no mail will be lost. The trick is to size the ram disk so that it never fills up.
- Since RAM is much, much faster than physical disks, employing a ram disk will reduce disk I/O and increase CPU usage somewhat, since the Amavis system is now capable of much speedier processing of emails.
- Put different disk-intensive directories on separate, appropriately configured spindles. Under /opt/zimbra ("~"), the three directories getting the most disk I/O are ~/backup, ~/data, ~/db, and ~/store. Mounting one or more of these directories on separate sets of spindles, or even on separate sets of spindles on different controllers will speed up disk I/O noticeably. Zimbra also enables multiple stores, so if you are deploying the Archiving and Discovery feature, you can set up separate stores on different spindle sets for the A&D mailboxes [confirm].
- Assuming the base disks in your server are (expensive) 15K U320 SCSI or 15K dual-port SAS drives, one strategy is to mount ~/backup on slower, less expensive disks. On servers with larger numbers of users (or just users with multi-GB mailboxes), this will lower server costs. On larger servers for example, we put in a second disk controller in the server, connected to a dual-bus MSA30 disk shelf. One bus has slower 10K 146GB drives in a RAID6 (ADG) array with a hot spare for ~/backup, and the other bus has faster 15K drives in several RAID1 arrays for additional message stores and for ~/db, leaving the remainder of ~ on the on-board RAID1 array.
Sample Hardware Configuration
Specific Configuration Steps
One advantage of running SLES10 on HP hardware is that HP provide their Insight Agents software for free. Combined with the locally installed HP Systems Management homepage, these proprietary rpms enable intimate hardware health monitoring and alerting. Once configured, you will get notices of failed fans, array accelerator battery charging issues, impending disk drive failures (e.g. disks that are members of an array), etc.
Although all of this monitoring can be done in Nagios, we have found it easier to do the low-level hardware health monitoring with HP's Insight Agents and general system health with Nagios.
Configuring Nagios properly is beyond the scope of this document, but here is a typical Nagios screen shot for one Zimbra server hosting nearly 300 mailboxes with on average 150 or so active sessions: