Performance Tuning Guidelines for Single-Server 100-500 User Systems

Performance Tuning Guidelines for Single Server 100-500 User System

   KB 2458        Last updated on 2015-07-12  




0.00
(0 votes)

This document is under construction, started by Mark Stone of Reliable Networks on 6/12/08. Please give us until 6/30/08 to complete version 1.0 of this document, after which we welcome editing improvements to this wiki page. Until then, please email comments to me at mark.stone@reliablenetworks.com. Once this page is complete I will remove this paragraph and post a notice in the forums.

Preamble

Reliable Networks started using Zimbra NE with version 4.0.3 and since then we have built a number of Zimbra systems for clients. We also have built a number of Zimbra systems for ourselves in our capacity as a Zimbra hosting partner. When we migrated systems to version 5, we found that the majority of our Zimbra 4.x tweaks to improve performance and reliability were no longer valid. Indeed, in some cases the tweaks which worked so well on Zimbra 4.x caused performance degradations on Zimbra 5 systems.

We recently successfully closed a support ticket covering one such system. In working with Zimbra support, we were able to get a more intimate understanding of a number of the "under-the-hood" differences between 32-bit Zimbra 4.x and 64-bit Zimbra 5.

We thought it would be helpful to share our collective experience with the community by documenting our (now updated for Zimbra 5) performance tuning best practices for single-server systems serving 100-500 users, which represent the majority of the systems we build and maintain.

A few caveats before we get started:

  1. We use SuSE Linux Enterprise Server 10 ("SLES10") exclusively. It's not that we don't like other distros (we do), but we wanted to standardize on one to make administration of multiple systems easier. As a result, some of the information here will be SuSE-specific. We'll try to point that out when we can, and would ask that users of other distros edit this wiki page with their distro-specific information where appropriate.
  2. We use HP hardware exclusively, for the same reasons we use SLES10. Again, we would ask users of other hardware platforms to edit this wiki page with their hardware-brand-specific information where appropriate.
  3. We use 64-bit hardware and software exclusively. PAE works fine for using more than 4GB of RAM on a 32-bit system, but when a system does have more than 4GB of RAM, using 64-bit hardware and software we find simplifies things.
  4. The way we suggest doing things may not be the only appropriate nor optimal way. Our suggestions for partitioning for example are just one way of doing things. There are other ways of partitioning that will give comparable results.
  5. Making use of any of the suggestions in this document is solely at your own risk. By making use of any of the suggestions in this wiki article, you acknowledge and agree that Reliable Networks (and any subsequent editors of this wiki article) have no liability to you nor to your customers, agents or employees, for any losses or damages suffered as a result of your incorporating any of the suggestions in this wiki article within any Zimbra system you administer.

If you are OK with all of the above, then let's get started.

The sections below provide a general discussion of each one of the topic areas. Specific steps to implement are then grouped together in one section towards the end.


Strategies and Objectives

Most of the tweaks in this article are driven the following strategies and objectives:

  1. "Right-Size" the server hardware. In our view, critical application servers should be life-cycled over five years and backed by a hardware support agreement. If the client is a fast-growing company, the hardware platform that is appropriate now will likely not be adequate two years from now, so the hardware platform needs to be chosen with expandability in mind. If the company is stable, more conservative hardware choices can be made.
  2. Manage by exception. Once configured, the system should not require a lot of attention and should be configured in a way such that attention is required only when changes are required or there is a system fault. Break/fix work on a mission-critical system should be avoided as much as practicable. In practice, this means very intimate monitoring of the system well beyond Zimbra's built-in monitoring.
  3. Build for Xen. Xen does not support NPTL right now, but Xen paravirtualization offers many advantages for backups and disaster recovery, so we do everything we can to ensure that when Xen is ready for Zimbra we will be able to P2V (physical-to-virtual conversion) a Zimbra server seamlessly.
  4. Maximize end-user perceived performance. Anything that can be done to make the web interface snappier will make users happier. For example, Zimbra is very disk intensive, so we do a number of things to reduce disk I/O. A less busy system means the web interface will be snappier.
  5. Keep costs low. It's tempting to just throw hardware at a performance bottleneck, but with mail servers in general it is often cheaper to spend a few cycles on configuration optimization to save four figures on hardware.

Pre-Deployment Hardware and Anti-Spam/Virus/Malware Optimizations

Like most mail servers, Zimbra is very disk intensive. The mail flowing through Postfix > Amamvis > ClamAV > Spamassassin > Amavis > Postfix > Zimbra store is the primary daytime load on disks. Backups also hammer disks, as do Java garbage collection routines and MySQL database updates and reindexing. Investments in reducing disk I/O therefore will have a very high ROI (return on investment) in terms of increased performance.

To reduce disk I/O bottlenecks, we employ a three-pronged strategy:

  • Offload initial Anti-Virus/Anti-Spam/Anti-Malware work to a separate, dedicated device. Since 80% or more of all email traffic is garbage, to the extent you can save your Zimbra server the trouble of filtering out most of this, you will reduce the load on your Zimbra server by 75% or more. Lower loads mean less powerful hardware is needed, and you can apply that savings to a pre-filtering device.
SonicWall TZ and PRO series devices with the Enhanced OS and the Gateway IPS/AV/AS license do RBL checking in addition to Anti-Virus/Anti-Spam/Anti-Malware. The annual license cost for this is a few hundred dollars (depending on the model), and typically enables us to shave $1,500 or more off the cost of server hardware, thereby more than paying for itself over the life of the server. This calculation does not include the incremental benefits of doing Anti-Virus/Anti-Spam/Anti-Malware at the network gateway instead of just at each device, which if included would only increase the ROI.
Spamassassin uses RBLs to add to a mail's spam score, whereas doing RBL checks at the gateway is binary and will result in "false positives" (legitimate email incorrectly identified as spam) if you use aggressive RBLs on the gateway. If you use one or more conservative RBLs on the gateway, you will virtually eliminate false positives at the gateway. As of this writing, zen.spamhaus.org is considered a comprehensive and conservative RBL. Note that a license from Spamhaus is required in certain circumstances.
  • Put Amavis's temp directory on a ram disk. Zimbra used to do this by default, but if it's not done exactly right mail stops flowing, so Zimbra took this out of the installation scripts. Since Amavis's management of Clam and Spamassassin processing is the single biggest hog of daytime disk I/O, this step, combined with pre-filtering above, will result in the biggest performance-bang-for-buck gains possible.
Is it safe? Dustin Hoffman didn't know, but in the case of Amavis the answer is "Yes." The reason is based on Postfix being so modular; the portion of Postfix which hands off emails to Amavis for processing is blissfully unaware of the other portion of Postfix which accepts reinjected messages from Amavis, after Amavis has finished shepherding the emails through Spamassassin and ClamAV. The actual event sequence is as follows:
Postfix hands off an email to Amavis for processing on port 10024.
Amavis accepts the message, but doesn't tell Postfix it has accepted the message just yet.
Amavis processes the message through ClamAV and Spamassassin, then reinjects the message back into Postfix on port 10025.
Once Postfix on port 10025 confirms to Amavis that it has accepted the (cleaned up) message, Amavis then tells Postfix on port 10024 that Amavis has successfully accepted the initial (potentially unclean) version of the message.
As a result, email messages always live in Postfix until final delivery to the Zimbra store, and thus if the ram disk blew up, no mail will be lost. The trick is to size the ram disk so that it never fills up.
Since RAM is much, much faster than physical disks, employing a ram disk will reduce disk I/O and increase CPU usage somewhat, since the Amavis system is now capable of much speedier processing of emails.
  • Put different disk-intensive directories on separate, appropriately configured spindles. Under /opt/zimbra ("~"), the three directories getting the most disk I/O are ~/backup, ~/data, ~/db, and ~/store. Mounting one or more of these directories on separate sets of spindles, or even on separate sets of spindles on different controllers will speed up disk I/O noticeably. Zimbra also enables multiple stores, so if you are deploying the Archiving and Discovery feature, you can set up separate stores on different spindle sets for the A&D mailboxes [confirm].
Assuming the base disks in your server are (expensive) 15K U320 SCSI or 15K dual-port SAS drives, one strategy is to mount ~/backup on slower, less expensive disks. On servers with larger numbers of users (or just users with multi-GB mailboxes), this will lower server costs. On larger servers for example, we put in a second disk controller in the server, connected to a dual-bus MSA30 disk shelf. One bus has slower 10K 146GB drives in a RAID6 (ADG) array with a hot spare for ~/backup, and the other bus has faster 15K drives in several RAID1 arrays for additional message stores and for ~/db, leaving the remainder of ~ on the on-board RAID1 array.


Post-Deployment Optimizations

Sample Hardware Configuration

Specific Configuration Steps

Monitoring Setup

One advantage of running SLES10 on HP hardware is that HP provide their Insight Agents software for free. Combined with the locally installed HP Systems Management homepage, these proprietary rpms enable intimate hardware health monitoring and alerting. Once configured, you will get notices of failed fans, array accelerator battery charging issues, impending disk drive failures (e.g. disks that are members of an array), etc.

Although all of this monitoring can be done in Nagios, we have found it easier to do the low-level hardware health monitoring with HP's Insight Agents and general system health with Nagios.

Configuring Nagios properly is beyond the scope of this document, but here is a typical Nagios screen shot for one Zimbra server hosting nearly 300 mailboxes with on average 150 or so active sessions:

Simple Checks to Confirm Proper Operation

Many of these checks will be old hat to experienced system administrators.


Disk Storage Usage Check (df)

The df command tells us how much free space exists on each partition. The -h switch means "human readable" so your get output in kilobytes, megabytes and gigabytes. If your Amavis ram disk is too small, likely you will run out of space there before Nagios notices, so running "df -h" repeatedly when you first migrate to a ram disk can help.

Here is a sample output:

 lmstone@ourserver:~> df -h
 Filesystem            Size  Used Avail Use% Mounted on
 /dev/cciss/c1d0p1      20G   12G  8.1G  60% /
 udev                  2.9G  120K  2.9G   1% /dev
 /dev/dm-0             197G   49G  139G  27% /opt
 /dev/mapper/system-lvm_zimbra_temp
                        51G   37G   12G  77% /zimbra_temp
 /dev/cciss/c0d0p1     135G   71G   58G  56% /opt/zimbra/backup
 /dev/cciss/c0d1p1     101G  129M   95G   1% /opt/zimbra/archive1
 /dev/shm              256M   60K  256M   1% /opt/zimbra/data/amavisd/tmp
 lmstone@ourserver:~>  


Server Memory and Workload Check (top)

In a typical Zimbra system with sufficient RAM, disk I/O is the bottleneck, so top can confirm this for you (as well as confirm the benefits from converting to an Amamvis ram disk).

When the CPU has to wait on the disk to process items, the percentage of CPU allocated to wait states (%wa) will be significant, and system load averages will be higher as well. Since the CPU is doing "work" waiting for disk, the percentage CPU utilization will also be high.

If the memory percentages allocated to MySQL and Java are also too high, relative to the amount of physical memory in the machine, top will show that the page file is being used. Note that as of this writing version 5.0.6 uses a lot of memory doing full backups, so a little swap file usage after a full backup is not unusual and will not impact performance adversely.

Here is a sample output:

 top - 15:09:30 up 11 days,  9:44,  1 user,  load average: 0.52, 0.50, 0.53
 Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
 Cpu(s):  1.8%us,  1.1%sy,  0.0%ni, 94.8%id,  2.2%wa,  0.0%hi,  0.0%si,  0.0%st
 Mem:   6050552k total,  5797600k used,   252952k free,   215176k buffers
 Swap:  8395952k total,      120k used,  8395832k free,  2065956k cached
 
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  6775 zimbra    16   0 2038m 1.1g 5604 S    1 19.0 110:45.31 mysqld
  8264 zimbra    16   0 2228m 769m  47m S    3 13.0 438:18.31 java
  8746 zimbra    16   0  238m 194m 1264 S    0  3.3  35:22.68 clamd
 19113 zimbra    18   0  617m 143m  15m S    0  2.4   7:27.59 java
 29471 zimbra    15   0  167m 107m 3608 S    0  1.8   0:02.91 amavisd
 18497 zimbra    16   0  163m 102m 3620 S    0  1.7   0:12.94 amavisd
 16515 zimbra    16   0  162m 102m 3616 S    0  1.7   0:10.48 amavisd
 25724 zimbra    16   0  162m 102m 3624 S    0  1.7   0:04.54 amavisd
 16545 zimbra    16   0  162m 101m 3608 S    0  1.7   0:06.01 amavisd
 19660 zimbra    16   0  161m 101m 3600 S    0  1.7   0:03.60 amavisd
 27652 zimbra    16   0  161m 100m 3616 S    0  1.7   0:03.99 amavisd
 20056 zimbra    16   0  161m 100m 3604 S    0  1.7   0:05.47 amavisd
 22221 zimbra    16   0  161m 100m 3616 S    0  1.7   0:04.62 amavisd
  6931 zimbra    16   0  159m  98m 3592 S    0  1.7   0:00.90 amavisd
 32673 zimbra    16   0  156m  96m 2468 S    0  1.6   0:24.05 amavisd
  4168 zimbra    18   0  380m  64m 9500 S    0  1.1  62:18.69 slapd
  5824 zimbra    16   0  128m  26m 4864 S    0  0.5  76:43.73 mysqld
 10505 zimbra    30  15 23164 9980 1776 S    0  0.2   0:19.27 perl
 10476 zimbra    30  15 23172 9944 1776 S    0  0.2   0:21.96 perl
 10735 zimbra    32  15 22252 7648 1252 S    0  0.1   1:32.36 zmmtaconfig
 10511 zimbra    30  15 24824 6356 2692 S    0  0.1   0:35.86 zmlogger
 21462 zimbra    16   0 35460 5856 2884 S    0  0.1   0:00.02 httpd
 10489 zimbra    39  15 15692 5660 1708 S    0  0.1   0:00.09 swatch
 10430 zimbra    37  15 15700 5648 1708 S    0  0.1   0:00.08 logswatch

After launching top above, the command "u zimbra" was entered to show only process owned by the zimbra user account. Next, [Shift] - [>] was selected to order the output by percentage of memory consumed.

At the time this snapshot was taken, this system had approximately 120 sessions open.

During peak loads on this server, top output looks like the following:


Verified Against: unknown Date Created: 6/12/2008
Article ID: https://wiki.zimbra.com/index.php?title=Performance_Tuning_Guidelines_for_Single-Server_100-500_User_Systems Date Modified: 2015-07-12



Try Zimbra

Try Zimbra Collaboration with a 60-day free trial.
Get it now »

Want to get involved?

You can contribute in the Community, Wiki, Code, or development of Zimlets.
Find out more. »

Looking for a Video?

Visit our YouTube channel to get the latest webinars, technology news, product overviews, and so much more.
Go to the YouTube channel »

Jump to: navigation, search