Ajcody-Notes-ServerPlanning: Difference between revisions

mNo edit summary
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{|  width="100%" border="0"
{{BC|Zeta Alliance}}                        <!-- Note, this will also add [[Category: Zeta Alliance]] to bottom of wiki page. -->
| bgcolor="orange" | [[Image:Attention.png]] - This article is NOT official Zimbra documentation. It is a user contribution and may include unsupported customizations, references, suggestions, or information.
__FORCETOC__                              <!-- Will force a TOC regards of size of article. __NOTOC__ if no TOC is wanted. -->
|}
<div class="col-md-12 ibox-content">
= Ajcody Notes Server Planning=            <!-- Normally will reflect page title. Is listed at very top of page. -->
{{KB|{{ZETA}}|{{ZCS 8.5}}|{{ZCS 8.0}}|{{ZCS 7.0}}|}}            <!-- Can only handle 3 ZCS versions. -->
{{WIP}}                                               <!-- For pages that are "work in progress". -->


==Server Performance Planning & Infrastructure - The Big Picture==
==Server Performance Planning & Infrastructure - The Big Picture==
Line 732: Line 735:


======EXT3 - EXT4======
======EXT3 - EXT4======
'''Supported'''


* 1.  
* 1.  
Line 837: Line 842:
** [http://www.oregontechsupport.com/samba/ The Unofficial Samba How-To]
** [http://www.oregontechsupport.com/samba/ The Unofficial Samba How-To]


=====Global , Distributed , Cluster-Type Filesystems=====
=====Global , Distributed , Cloud, Cluster-Type Filesystems - Unsupported=====


----
----
Line 850: Line 855:
======GFS======
======GFS======


* 1.
** A.
** B.
* 2.
** A.
** B.
* References:
* References:
** [http://www.redhat.com/gfs/ Red Hat Global File System Homepage]
** [http://www.redhat.com/gfs/ Red Hat Global File System Homepage]
Line 863: Line 862:
======Lustre [Acquired by Sun]======
======Lustre [Acquired by Sun]======


* 1.
** A.
** B.
* 2.
** A.
** B.
* References:
* References:
** [http://wiki.lustre.org/index.php?title=Main_Page Lustre Wiki]
** [http://wiki.lustre.org/index.php?title=Main_Page Lustre Wiki]
Line 875: Line 868:
======Hadoop Distributed File System (HDFS) - Cluster Filesystem Project From Apache & Yahoo!======
======Hadoop Distributed File System (HDFS) - Cluster Filesystem Project From Apache & Yahoo!======


* 1.
** A.
** B.
* 2.
** A.
** B.
* References:
* References:
** [http://hadoop.apache.org/core/ Hadoop General Project Page]
** [http://hadoop.apache.org/core/ Hadoop General Project Page]
Line 889: Line 876:


======SGI CXFS======
======SGI CXFS======
 
 
* 1.
* References:
** A.
** [http://www.sgi.com/products/storage/cxfs.html SGI CXFS Homepage]
** B.
* 2.
** A.
** B.
* References:
** [http://www.sgi.com/products/storage/cxfs.html SGI CXFS Homepage]


======IBM GPFS======
======IBM GPFS======


* 1.
** A.
** B.
* 2.
** A.
** B.
* References:
* References:
** [http://www-03.ibm.com/systems/clusters/software/gpfs/index.html IBM General Parallel File System Homepage]
** [http://www-03.ibm.com/systems/clusters/software/gpfs/index.html IBM General Parallel File System Homepage]
Line 912: Line 887:
====== Veritas Storage Foundation Cluster File System======
====== Veritas Storage Foundation Cluster File System======


* 1.
** A.
** B.
* 2.
** A.
** B.
* Resources:
* Resources:
** [http://www.symantec.com/business/storage-foundation-cluster-file-system  Veritas Storage Foundation Cluster File System Product Page]
** [http://www.symantec.com/business/storage-foundation-cluster-file-system  Veritas Storage Foundation Cluster File System Product Page]
** [http://www.symantec.com/business/storage-foundation Veritas Storage Foundation Product Page]
** [http://www.symantec.com/business/storage-foundation Veritas Storage Foundation Product Page]
======Google Computing======
Unsupported at this time by Zimbra Support.
References:
* Google Cloud Platform Homepage
** https://cloud.google.com/compute/
*** Support, QA, testing for Google Compute Engine
**** https://bugzilla.zimbra.com/show_bug.cgi?id=96125


======Amazon S3 - And Amazon EC2 Information======
======Amazon S3 - And Amazon EC2 Information======


Still going to try this when it's not supported? See the following if you run into troubles and if your doing a new setup:
Unsupported at this time by Zimbra Support.
 
* http://www.zimbra.com/forums/installation/47123-installing-zimbra-ubuntu-10-04-amazon-ec2.html
** Use this url instead, the one in the forum is bad.
*** * http://elijahpaul.co.uk/installing-zimbra-7-0-zcs-on-ubuntu-10-04-lts-using-amazon-aws/


References:
References:
Line 949: Line 925:
** [[Ajcody-Backup-Restore-Issues#Amazon_S3_.2C_Amazon_EC2_.2C_SecoBackup_And.2FOr_Tar]]
** [[Ajcody-Backup-Restore-Issues#Amazon_S3_.2C_Amazon_EC2_.2C_SecoBackup_And.2FOr_Tar]]
* http://www.zimbra.com/forums/installation/25306-remote-backup-cloud-storage-amazon-s3.html
* http://www.zimbra.com/forums/installation/25306-remote-backup-cloud-storage-amazon-s3.html
Still going to try this when it's not supported? See the following if you run into troubles and if your doing a new setup:
* http://www.zimbra.com/forums/installation/47123-installing-zimbra-ubuntu-10-04-amazon-ec2.html
** Use this url instead, the one in the forum is bad.
*** * http://elijahpaul.co.uk/installing-zimbra-7-0-zcs-on-ubuntu-10-04-lts-using-amazon-aws/


======Other Products - Applications (DRBD, Veritas Volume Replicator, etc.)======
======Other Products - Applications (DRBD, Veritas Volume Replicator, etc.)======
Unsupported at this time by Zimbra Support.


* DRBD
* DRBD
Line 1,517: Line 1,501:


http://wiki.zimbra.com/index.php?title=Network_Edition_Disaster_Recovery
http://wiki.zimbra.com/index.php?title=Network_Edition_Disaster_Recovery
----


[[Category: Community Sandbox]]
[[Category: Community Sandbox]]
Line 1,526: Line 1,513:
[[Category:Multi-Server]]
[[Category:Multi-Server]]
[[Category:Planning and Design]]
[[Category:Planning and Design]]
[[Category: Author:Ajcody]]
[[Category: Zeta Alliance]]

Latest revision as of 03:40, 21 June 2016

Ajcody Notes Server Planning

   KB 2480        Last updated on 2016-06-21  




0.00
(0 votes)
24px ‎  - This is Zeta Alliance Certified Documentation. The content has been tested by the Community.


Server Performance Planning & Infrastructure - The Big Picture

Actual Server Planning Homepage

Please see Ajcody-Notes-ServerPlanning

Initial Comments

These are just my random thoughts. They have not been peer-reviewed yet.

Actual DR and Server restore issues are listed in my Ajcody-Server-Topics page.

Items For Your Review


Redundancy - High Availability - And Other Related Questions

One might ask, "Is there any other way to have redundant zimbra email servers?"

The short answer is No, but if you have millions of dollars you can get really close.

Remember, redundancy always comes with a dollar amount cost. And this dollar amount will determine your range of options to implement redundant technologies or processes.

Redundancy isn't magical and all choices have a "time" for the operation to failover and the application being used can restrict what can be done or increase the length of "time" of the operation. Timing of the operation is where the service/server has a client side impact of being "unavailable".

So you break down the server and services into components and put them in a graph or excel sheet. This will help to make sure your not "over engineering" the system.

For example, disk and data.

Disks can use different raid systems to give redundancy. Channels to the disk (remote) can also have redundant paths. The disk chassis can have redundant power supplies which goto different UPS's on different circuit breakers. This data can also be sent to tape, rysnc'd/copied to another disk subsystem on another server, or "flashed" to another location if your filesystem or SAN unit supports it. There's allot that has to fail for you to completely use the data. When exploring these items, you want to have multiple channel paths to make sure copies, rsync's, flashing occurs differently as compared to the "production" path. Have your network backup occur on a different ethernet device.

Two big picture objectives for "redundancy" are "data redundancy" for DR situations and "application" redundancy for service availability. Our "cluster" situation is an active-passive "redundancy" for the purposes of the mailstore application layer. The raid level of the disks on the san servers the "redundancy" in regards to data recovery for DR.

When you introduce the objective of having off-site redundancy the costs and issues become huge. Remote locations introduce speed issue for data transfers which will also impact performance for most applications as it tries to stay in sync between the two machines for write and roll-back purposes.

My intent wasn't so much to give specific answers to this question but rather demonstrate that to answers these questions you have to get down to the real specifics - it's simply impossible to answer the broad question of "how do I make redundant servers". It would take a book to fully explore all the issues and possibilities but it still comes down to - how much money you have to spend on the project. I use to work with High Performance Clusters and HA Linux+Oracle+SAP global deployments with TB's of data - this issue would arise daily for us in those environments.

HA Clustering Software

From the Single Node Cluster Installation Guide - Rev 1 June 2008:

For ZCS 5.0.7 to ZCS 7. With ZCS 8, only Vmware HA is support by Zimbra. RHCS still can work but it would be hardware only monitoring and support should be directed to Redhat for the specifics.
For cluster integration to provide high availability, Zimbra Collaboration Suite (ZCS) can integrate with either of the following:
  • Red Hat® Enterprise Linux® Cluster Suite version 4, Update 5 or later update release. In the single-node cluster implementation, all Zimbra servers are part of a cluster under the control of the Red Hat Cluster Manager.
Note: Red Hat Cluster Suite consists of Red Hat Cluster Manager and Linux Virtual Server Cluster. For ZCS, only Red Hat Cluster Manager is used. In this guide, Red Hat Cluster Suite refers only to Cluster Manager.
  • Veritas™ Cluster Server by Symantec (VCS) version 5.0 with maintenance pack 1 or later.

References:

CPU And Motherboards

Other references: Performance_Tuning_Guidelines_for_Large_Deployments#RAM_and_CPU

This is your mail server, there is no higher profile application in your environment most likely than this box. You'll want to think 3 to 5 years down the road when you spec out the box. Make sure to confirm:

  • It will scale or offer:
    • Hotpluggable technology
      • You'll need to confirm the OS can handle this
    • Redundant and Hot-swap power supplies
      • You might need to upgrade your power supplies depending on the "accessories" you put in it.
    • CPU's (There is NO reason to use a 32bit chip - the memory limitations will kill you)
      • By default, a 32-bit Linux kernel only allows each process to address 2GB of space. Through PAE (Process Address Extension), a feature available in some CPUs, and a special 32-bit kernel that supports large address space for processes, it is possible to get a 32-bit mode kernel that really uses > 4GB of RAM, and get a per process 3-4GB address range
    • Memory
      • What are your onboard cache options?
      • How many slots are available? Does it force you to buy the large memory sticks to max out total memory - this increases cost?
      • Can you mix & match different memory size sticks? (This will increase costs when you goto scale)
      • Understand how memory interacts with multi-cpu's if your server has them
      • Understand the front side bus (FSB)
      • Does the motherboard offline for bad memory detection
    • Have growth for expansion cards and the right kind
      • What is PCI Express?
      • How PCI Express Works?
      • Know what slots you need for your network card, raid card, san card, etc.
        • Then do a sanity check against the motherboard and understand if the cpu and channels can allow the full throughout from all channels.
      • Is there room for redundancy with your expansion cards?

Memory

Other references: Performance_Tuning_Guidelines_for_Large_Deployments#RAM_and_CPU

When I was working with HPC, I found there was never a good reason to NOT at least start with 32GB's of RAM. Now, a mail server isn't a HPC compute node - I understand that. But I would still try to spec out a system that has at least 8 memory slots (usually a ratio of x slots per cpu on system) but allows me to use 4 4GB DIMMS giving me 16GB of memory and allows an upgrade path choice of 4 more 4GB or 8GB DIMMS. Get the higher speed DIMMS, you can't mixed memory of different speeds.

Chart Of Memory Speeds
Memory Interconnect Buses Bit Bytes
PC2100 DDR-SDRAM (single channel) 16.8 Gbit/s 2.1 GB/s
PC1200 RDRAM (single-channel) 19.2 Gbit/s 2.4 GB/s
PC2700 DDR-SDRAM (single channel) 21.6 Gbit/s 2.7 GB/s
PC800 RDRAM (dual-channel) 25.6 Gbit/s 3.2 GB/s
PC1600 DDR-SDRAM (dual channel) 25.6 Gbit/s 3.2 GB/s
PC3200 DDR-SDRAM (single channel) 25.6 Gbit/s 3.2 GB/s
PC2-3200 DDR2-SDRAM (single channel) 25.6 Gbit/s 3.2 GB/s
PC1066 RDRAM (dual-channel) 33.6 Gbit/s 4.2 GB/s
PC2100 DDR-SDRAM (dual channel) 33.6 Gbit/s 4.2 GB/s
PC2-4200 DDR2-SDRAM (single channel) 34.136 Gbit/s 4.267 GB/s
PC4000 DDR-SDRAM (single channel) 34.3 Gbit/s 4.287 GB/s
PC1200 RDRAM (dual-channel) 38.4 Gbit/s 4.8 GB/s
PC2-5300 DDR2-SDRAM (single channel) 42.4 Gbit/s 5.3 GB/s
PC2-5400 DDR2-SDRAM (single channel) 42.664 Gbit/s 5.333 GB/s
PC2700 DDR-SDRAM (dual channel) 43.2 Gbit/s 5.4 GB/s
PC3200 DDR-SDRAM (dual channel) 51.2 Gbit/s 6.4 GB/s
PC2-3200 DDR2-SDRAM (dual channel) 51.2 Gbit/s 6.4 GB/s
PC2-6400 DDR2-SDRAM (single channel) 51.2 Gbit/s 6.4 GB/s
PC4000 DDR-SDRAM (dual channel) 67.2 Gbit/s 8.4 GB/s
PC2-4200 DDR2-SDRAM (dual channel) 67.2 Gbit/s 8.4 GB/s
PC2-5300 DDR2-SDRAM (dual channel) 84.8 Gbit/s 10.6 GB/s
PC2-5400 DDR2-SDRAM (dual channel) 85.328 Gbit/s 10.666 GB/s
PC2-6400 DDR2-SDRAM (dual channel) 102.4 Gbit/s 12.8 GB/s
PC2-8000 DDR2-SDRAM (dual channel) 128.0 Gbit/s 16.0 GB/s
PC2-8500 DDR2-SDRAM (dual channel) 136.0 Gbit/s 17 GB/s
PC3-8500 DDR3-SDRAM (dual channel) 136.0 Gbit/s 17 GB/s
PC3-10600 DDR3-SDRAM (dual channel) 165.6 Gbit/s 21.2 GB/s
PC3-12800 DDR3-SDRAM (dual channel) 204.8 Gbit/s 25.6 GB/s

Bus For Expansion Cards [Peripheral buses]

Typical uses are for network, san, raid cards. I'll deal with them separately under each section below.

Chart Of Bus Speeds
Interconnect Max speed (MB/s) Comments
PCI 2.0 132.0 MB/s
PCI 2.1 264.0 MB/s
PCI 2.2 528 MB/s
PCI-X 1.0 1 GB/s
PCI-X 2.0 4 GB/s
PCI-E (Express) Ver. 1.1 250 MB/s x2 - bi-directional These speeds are bi-directional per "lane". Meaning that they are the same going both ways and not shared.
Ver. 1.1 @ 1x 256 MB/s x2 - bi-directional PCI-E Ver. 1.1 notes.
Ver. 1.1 @ 2x 512 MB/s x2 - bi-directional PCI-E Ver. 1.1 notes.
Ver. 1.1 @ 4x 1 GB/s x2 - bi-directional PCI-E Ver. 1.1 notes.
Ver. 1.1 @ 8x 2 GB/s x2 - bi-directional PCI-E Ver. 1.1 notes.
Ver. 1.1 @ 16x 4 GB/s x2 - bi-directional PCI-E Ver. 1.1 notes.
PCI-E (Express) Ver. 2.0 400 MB/s x2 - bi-directional 500 MBs but there's a 20% overhead hit. These speeds are bi-directional per "lane". Meaning that they are the same going both ways and not shared.
Ver. 2 @ 1x 400 MB/s x2 - bi-directional PCI-E Ver. 2 notes.
Ver. 2 @ 4x 1600 MB/s x2 - bi-directional PCI-E Ver. 2 notes.
Ver. 2 @ 8x 3200 MB/s x2 - bi-directional PCI-E Ver. 2 notes.
Ver. 2 @ 16x 6400 MB/s x2 - bi-directional PCI-E Ver. 2 notes.
PCI-E (Express) Ver. 3.0 1 GB/s x2 - bi-directional The final spec is due in 2009. These speeds are bi-directional per "lane". Meaning that they are the same going both ways and not shared.
Front-side Bus replacements below
HyperTransport 1.x 6.4 GB/s x2 - bi-directional Bidirectional per 32 bit link at 800MHz. Front-side bus replacement, see HyperTransport for more details .
HyperTransport 2.0 11.2 GB/s x2 - bi-directional Bidirectional per 32 bit link at 1.4GHz. Front-side bus replacement, see HyperTransport for more details .
HyperTransport 3.0 20.8 GB/s x2 - bi-directional Bidirectional per 32 bit link at 2.6GHz/ Front-side bus replacement, see HyperTransport for more details .
HyperTransport 3.1 25.6 GB/s x2 - bi-directional Bidirectional per 32 bit link at 3.2GHz. Front-side bus replacement, see HyperTransport for more details .
QuickPath Interconnect 12.8 GB/s x2 - bi-directional Intel competitor to HyperTransport. Everything you wanted to know about QuickPath (QPI)

Network Infrastructure

Network Cards

Most motherboards will have integrated ethernet ports. If they are the same chipset then you might want to just channel bond these. What I have done in the past, is used them for management ports and added in network cards for production activity.

Ideally, I would buy two cards that had two ports (Gb). I would then channel bond 2 of the ports across the cards, for 2 separate bondings. I would use one of those bonds for "front facing" traffic and the other for "backup" traffic. Remember to consider the bus infrastructure and other cards when deciding on what ones to get.

Channel Bonding

This will require you to confirm your switches can do this. There are different choices when it comes to channel bonding and some of them require the switch to support it - it usually involves "bonding" on the switch side by configuring the ports in question.

The Production Channel Bond

The "production" bond is more in regards to failover. Make sure this happens as expected through the application layer. That proxies, firewall, and so forth allow on of the ports in the channel bond to go down without any end-user impact.

The Backup Channel Bond

The backup port is for throughput. You'll want to map out the network path and switches to the end host that data is being moved to. You'll also need to confirm the network path actually gives you a gain in performance by doing the channel bonding.

This will give you an excellent way to off load your tape backups, rsyncs (as shown at Ajcody-Notes-ServerMove ), and maybe nfs mounts if your using them.

You'll want to map out the network path and switches to the end host the data is being moved to. You'll also need to confirm the network path actually gives you a gain in performance by doing the channel bonding.

Disk And Data Infrastructure

Other references: Performance_Tuning_Guidelines_for_Large_Deployments#Disk

Disk Types


References


Description of Items Used Below For Specific HDD's:

  • Notable Items
    • Benefits
    • Downsides
  • Bus Interface
    • Types:
    • Maximum Devices:
    • Bus Speeds:
  • Performance
    • I/Os per second & Sustained Transfer Rate:
    • Spindle Speed: This is the speed the drives disk actually spins at. This ranges from 5400rpm to 15,000rpm. The higher the speed the more often the data on the disk will be in the right position to be read by the drive heads, and the faster data can be transferred.
    • Average Access Time: Time. This is the average time is takes to position the heads so that data can be read. The faster the better.
    • Cache Size: This is the size of the cache on board the disk drive itself. This does reach a point where doubling the size generates a very small boost in performance and is not worth the cost, but generally the bigger the better. It also assists in small bursts of data where you may actually achieve near maximum performance as data is read from cache rather than the drive.
    • Internal Transfer Rate: This is the speed that data can be transferred within the drive. This speed will be higher than the actual transfer rate of the drive as there is some overhead for protocol handling as data is transferred to the SCSI or IDE bus.
    • Latency: the time it takes for the selected sector to be positioned under the read/write head. Latency is directly related to the spindle speed of the drive and such is influenced solely by the drive's spindle characteristics.
  • Reliability (MTBF or unrecoverable read error rate)
  • Capacity
  • Price
Ultra DMA ATA HDD's
  1. Notable Items
    • Benefits
    • Downsides
  2. Bus Interface
    • Types : ATA
  3. Performance
    • I/Os per second & Sustained transfer rate:
      1. Ultra DMA ATA 33 - 264 Mbit/s 33 MB/s
      2. Ultra DMA ATA 66 - 528 Mbit/s 66 MB/s
      3. Ultra DMA ATA 100 - 800 Mbit/s 100 MB/s
      4. Ultra DMA ATA 133 - 1,064 Mbit/s 133 MB/s
    • Spindle Speed:
    • Average Access Time:
    • Cache Size:
    • Internal Transfer Rate:
    • Latency:
  4. Reliability (MTBF or unrecoverable read error rate)
  5. Capacity
  6. Price
SATA HDD's
  1. Notable Items
    1. Benefits
      • SATA drives typically draw less power than traditional SAS HDDs due to slower RPM speeds.
      • SATA drives have the best dollar per gigabyte compared to SAS drives.
      • SATA HHDs can work on a SAS interface.
    2. Downsides
      • SATA HDDs are single port and not capable of being utilized in dual port environments without the addition of an interposer designed for this purpose.
  2. Bus Interface Type
    • Types: SATA , SAS
  3. Performance
    • I/Os per second & Sustained transfer rate:
      1. Serial ATA (SATA-150) - 1,500 Mbit/s 187.5 MB/s
        • Real speed: 150 MB/s
      2. Serial ATA (SATA-300) - 3,000 Mbit/s 375 MB/s
        • Real speed: 300 MB/s
        • (alternate names: SATA II or SATA2)
      3. Serial ATA (SATA-600) - 4,800 Mbit/s 600 MB/s
        • I've seen this listed as well though, SATA 6.0 Gbit/s (SATA 6Gb/s).
        • Standard is expected to be available before the end of 2008.
    • Spindle Speed: 7200 RPM , 5400 RPM
    • Average Access Time:
    • Cache Size:
    • Internal Transfer Rate:
    • Latency:
  4. Reliability (MTBF or unrecoverable read error rate)
  5. Capacity
  6. Price
SCSI (Parallel SCSI) HDD's
  1. Notable Items
    • Benefits
    • Downsides
  2. Bus Interface Type
    • Types: SCSI
    • Maximum Devices (On Single Channel):
      • Ultra Wide SCSI - 16
      • Ultra2 SCSI - 8
      • Ultra2 Wide SCSI - 16
      • Ultra3 SCSI - 16
        • (alternate names: Ultra-160, Fast-80 wide)
      • Ultra-320 SCSI - 16
        • (alternate name: Ultra4 SCSI)
      • Ultra-640 SCSI - 16
    • Bus Speeds:
      • The total amount of data that can be transferred throughout the whole channel.
      • Data transfers will step down to the rate speed of the drive. If you have a Ultra160 HDD on a Ultra320 controller it operate at 160 MB/s.
      • SCSI-3 : Also known as Ultra SCSI and fast-20 SCSI. Bus speeds at 20 MB/s for narrow (8 bit) systems and 40 MB/s for wide (16-bit).
      • Ultra-2 : Also know as LVD SCSI. Data transfer to 80 MB/s.
      • Ultra-3 : Also known as Ultra-160 SCSI. Data transfer to 160 MB/s.
      • Ultra-320 : Data transfer to 320 MB/s.
      • Ultra-640 : Also known as Fast-320. Data transfer to 640 MB/s.
  3. Performance
    • I/Os per second & Sustained Transfer Rate:
      • Ultra Wide SCSI 40 (16 bits/20MHz) - 320 Mbit/s 40 MB/s
        • Real speed: 40 MB/s
      • Ultra2 SCSI
        • Real speed: 40 MB/s
      • Ultra2 Wide SCSI 80 (16 bits/40 MHz) - 640 Mbit/s 80 MB/s
        • Real speed: 80 MB/s
      • Ultra3 SCSI 160 (16 bits/80 MHz DDR) - 1,280 Mbit/s 160 MB/s
        • (alternate names: Ultra-160, Fast-80 wide)
        • Real speed: 160 MB/s
      • Ultra-320 SCSI (16 bits/80 MHz DDR) - 2,560 Mbit/s 320 MB/s
        • (alternate name: Ultra4 SCSI)
        • Real speed: 320 MB/s
      • Ultra-640 SCSI (16 bits/160 MHz DDR) - 5,120 Mbit/s 640 MB/s
        • Real speed: 640 MB/s
    • Spindle Speed:
    • Average Access Time:
    • Cache Size:
    • Internal Transfer Rate:
    • Latency:
  4. Reliability (MTBF or unrecoverable read error rate)
  5. Capacity
  6. Price
SAS (Serial Attached SCSI) HDD's
  1. Notable Items
    1. Benefits
      • SAS HDDs are true dual port, full duplex devices. This means SAS HDDs can simultaneously process commands on both ports.
      • All SAS HDDs are hot-swap capable. Users can add or remove an HDD without disrupting the enterprise environment.
      • SAS HDDs can support online firmware update (check vendor). This allows users to update firmware on the SAS HDD without having to schedule downtime.
    2. Downsides
      • SAS HDDs cannot be used on the older architecture SCSI backplanes or cables.
      • SAS HDDs typically draw more power than the equivalent SATA counterparts.
  2. Bus Interface Type
    • Types: SAS
    • Maximum Devices:
      • SAS 16,256
        • 128 devices per port expanders
    • Bus Speeds:
  3. Performance
    • I/O per second & Sustained transfer rate:
      • Serial Attached SCSI (SAS) - 3,000 Mbit/s 375 MB/s
        • Real Speed: 300 MB/s (full duplex , per direction)
      • Serial Attached SCSI 2 - 6,000 Mbit/s 750 MB/s
        • Planned
    • Spindle Speed: 15000 RPM , 10000 RPM , 7200 RPM
    • Average Access Time:
    • Cache Size:
    • Internal Transfer Rate:
    • Latency:
  4. Reliability (MTBF or unrecoverable read error rate)
    • 10K & 15K SAS HDDs has been rated at 1.6 million hours MTBF
  5. Capacity
  6. Price
Fibre Channel - Arbitrated Loop (FC-AL) HDD's
  1. Notable Items
    1. Benefits
      • FC-AL HDDs are dual ported, providing two simultaneous input/output sessions
      • FC-AL HDDs are hot-swap capable so users can add and remove a hard drives without interrupting system operation.
    2. Downsides
      • FC-AL HDDs are typically utilized in unique environments and are not compatible to be used on SAS or SATA interfaces
      • Long term, SAS is projected to replace FC-AL HDDs within the IT industry
  2. Bus Interface Type
    • Types: FC-AL
    • Maximum Devices:
      • FC-AL in a private loop with 8-bit ID's - 127
      • FC-AL in a public loop with 24-bit ID's - +16 million
    • Bus Speeds:
  3. Performance
    • I/Os per second & Sustained transfer rate:
      • FC-AL 1Gb 100 MB/s (full duplex , per direction)
      • FC-AL 2Gb 200 MB/s (full duplex , per direction)
      • FC-AL 4GB 400 MB/s (full duplex , per direction)
    • Spindle Speed:
    • Average Access Time:
    • Cache Size:
    • Internal Transfer Rate:
    • Latency:
  4. Reliability (MTBF or unrecoverable read error rate)
  5. Capacity
  6. Price

Raid Cards (Not Raid Level)


  • If your raid chip on your motherboard goes out, what is your expect downtime to resolve it?
    • Isn't it easier to buy a hot space raid card and replace it in case of a raid card failing?

SAN Typologies - Interfaces


General Questions
San Cards
  • Have you planned on multipathing?
    • You do have more than one port/card, right?
      • If not (because of cost), you do have an open slot on motherboard to allow another one?
      • Remember to consider this with your SAN switch purchase
    • Confirm that your HBA's, HBA drivers, SAN switches allow for the options you want?
    • multipathing can be for failover
    • multipathing can increase throughput for same partition mount
iSCSI Interface

So far, physical devices have not featured native iSCSI interfaces on a component level. Instead, devices with SCSI Parallel Interface or Fibre Channel interfaces are bridged by using iSCSI target software, external bridges, or controllers internal to the device enclosure. Source

  • iSCSI over Fast Ethernet 100 Mbit/s 12.5 MB/s
  • iSCSI over Gigabit Ethernet 1,000 Mbit/s 125 MB/s
  • iSCSI over 10G Ethernet (Very few products exist) 10,000 Mbit/s 1,250 MB/s
  • iSCSI over 100G Ethernet (Planned) 100,000 Mbit/s 12,500 MB/s
iSCSI Performance Topics

One will find the following recommendations in regards to iscsi:

  • Private network for iscsi traffic
  • Jumbo frames enabled with MTU size of 9000
    • All equipment must support and be enabled for this from point to point.
      • Remember, point to point meaning the "server" to the host serving the iscsi disk array.
iSCSI References
Serial Attached SCSI (SAS) Interface
  1. Notable Items
    1. Benefits
      • SAS interface protocol supports both SAS and SATA Hard Disk Drives allowing tiered storage management.
      • SAS supports seamless scalability through port expansion enabling customers to daisy chain multiple storage enclosures.
      • SAS supports port aggregation via a x4 wide-link for a full external connection bandwidth of up to 12.0 Gbps (1200MBps) on a single cable and a single connection.
      • SAS is a point-to-point interface allowing each device on a connection to have the entire bandwidth available. The current bandwidth of each SAS port is 3Gb/sec. with future generations aimed at 6Gb/sec and beyond.
    2. Downsides
      • SAS is not backwards compatible with U320 SCSI or previous SCSI generations
Fibre Channel - Arbitrated Loop (FC-AL)
  1. Notable Items
    1. Benefits
      • FC-AL devices can be dual ported, providing two simultaneous input/output sessions that doubles maximum throughput
      • FC-AL enables "hot swapping," so you can add and remove hard drives without interrupting system operation, an important option in server environments.
    2. Downsides
      • FC-AL adapters tend to cost more than SAS adapters
      • FC-AL is currently the fastest interface at 4Gb but is expected to be passed in maximum bandwidth by the next generation of SAS interface at 6Gb

Multipathing


General References
Multipathing And SAN Persistent Binding
Multipathing And LVM
Multipathing - Redundant Disk Array Controller [RDAC] (LSI mpp driver)

References:

Multipathing - Fibreutils - sg3_utils (QLogic)

References:

Multipathing - lpfc (Emulex)

References:

Multipathing - device-mapper [dm] & multipath-tools (Linux OSS driver & tools)

DM is the open source multipath driver.

References:

Multipathing - Veritas (Symantec)

References:

Multipathing - HP & Linux

References:

Multipathing - NetApp

References:

Multipathing - EMC (PowerPath)

References:

Raid Levels For Disks


General References

References:

Raid 10

RAID 10 (or 1+0) uses both striping and mirroring.

Reference - http://bugzilla.zimbra.com/show_bug.cgi?id=10700 - I fixed statement based upon a private comment in this RFE.

Zimbra recommends Raid 10 for the :
Mailstore and Logger [MySQL] databases - /opt/zimbra/db & /opt/zimbra/logger
Indexing Volume [Lucene] database - /opt/zimbra/index
Raid 10 is NOT Raid 0+1
RAID 0+1 takes 2 RAID0 volumes and mirrors them. If one drive in one of the underlying RAID0 groups dies, that RAID0 group is unreadable. If a single drive simultaneously dies in the other RAID0 group, the whole RAID 0+1 volume is unreadable.
RAID 1+0 takes "mirrored slices" and build a RAID0 on top of them. A RAID10 volume can lose a maximum of half of its underlying disks at once without data loss (as long as only one disk from each "slice pair" is lost.)
Raid 5

RAID 5 (striped disks with parity) combines three or more disks in a way that protects data against loss of any one disk; the storage capacity of the array is reduced by one disk.

Reference - http://bugzilla.zimbra.com/show_bug.cgi?id=10700

Zimbra does NOT recommend RAID 5 for the MySQL database and Lucene volumes (RAID 5 has poor write performance, and as such is generally not recommended by MySQL, Oracle, and other database vendors for anything but read-only datastores. We have seen order of magnitude performance degradation for the embedded Zimbra database running on RAID 5!).
Mailstore and Logger [MySQL] databases - /opt/zimbra/db & /opt/zimbra/logger
Indexing Volume [Lucene] database - /opt/zimbra/index
Raid 6

RAID 6 (less common) can recover from the loss of two disks. Basically an extension of Raid 5.

Filesystem Choices


Other references:

Bugs/RFE's that reference filesystems:

Most configuration go with ext3 because of it's implied default. My own experience has me using xfs always except for the root ( / ) partition.

Filesystem Feature And Option Descriptions
Journaling
Inodes

To see existing inode use:

# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     19005440  158715 18846725    1% /
/dev/sda1              26104      39   26065    1% /boot
Block Size
Volume Managers

LVM / LVM2 (Linux)
EVMS (IBM's Enterprise Volume Management System)
Veritas Volume Manager
Regular File Systems

XFS

Supported

  • 1. Bugs and RFE's
  • 2. Journaling
    • A. "XFS uses what is called a metadata journal. Basically, this means that every disc transaction is written in a journal before it is written to the disc and then marked as "done" in the journal when it finishes. If the system crashes during the writing of the journal entry, that incomplete entry can be ignored since the data on the disc has not been touched yet and if the journal entry is not marked done, then that operation can be rolled back to preserve disc integrity. Its a very nice system. As stated above, XFS practices a type of journaling called "metadata journaling." This means only the inodes are journaled, not the actual data. This will preserve the integrity of the file system, but does not preserve the integrity of the data." reference: Filesystem Design Part 1 : XFS
    • B.
  • 3. Inodes
    • A. "FS considers dynamic allocation of inodes and keeps track of such inodes using B+Trees. Each allocation group uses a B+Tree to index the locations of inodes in it. This allows to create millions of inodes in each allocation group and thus supporting large number of files" reference: PDF Warning - Failure Analysis of SGI XFS File System
    • B.
  • Resources:
EXT3 - EXT4

Supported

Reiser
vxfs - Veritas
  • 1.
    • A.
    • B.
  • 2.
    • A.
    • B.
  • 3. Dynamic inode allocation
    • A. A description of sorts of what this means, "Vxfs is dynamic, so the inodes are basically created on the fly. However, it's a bit inaccurate to actually state that inodes are dynamic in VxFS and leave it at that (somewhat confusing I know). Vxfs really doesn't use them per se. It creates them to be compatible with UFS. VxFS uses extent-based allocation rather than the block-based UFS controlled by inodes. So, the question of how many inodes for vxfs, is as many as it needs."
    • B. http://www.docs.hp.com/en/B3929-90011/ch02s04.html
  • Resources:
BTRFS - Oracle's Better File System for Linux (with Redhat, Intel, HP)


ZFS
  • 1.
    • A.
    • B.
  • 2.
    • A.
    • B.
  • Resources:
Network File Systems

NFS
  • 1. Only supported for the Zimbra backup directory or when used internally to Vmware vSphere at this time. Please see Bug/RFE and comments below.
  • 2.
    • A.
    • B.
  • Resources:
  • Bugs/RFE's:
    • "Need clarity on supporting nfs mounted zimbra directories - report error/msg if nfs mount is present"
    • "Zimbra on NFS Storage through VMware ESX"
      • http://bugzilla.zimbra.com/show_bug.cgi?id=50635
      • Note - I asked this rfe to become private since it is really an internal request to do testing/qa'ing of nfs with vSphere. I asked another rfe for public viewing to be made that will let customers know when that can deploy with it under production use.

Purposed inclusion to release notes from Bug/RFE above:

ZCS & NFS:
Zimbra will support customers that store backups (e.g. /opt/zimbra/backup) on an NFS-mounted partition. Please note that this does not relieve the customer from the responsibility of providing a storage system with a performance level appropriate to their desired backup and restore times. In our experience, network-based storage access is more likely to encounter latency or disconnects than is equivalent storage attached by fiber channel or direct SCSI.
Zimbra continues to view NFS storage of other parts of the system as unsupported. Our testing has shown poor read and write performance for small files over NFS implementations, and as such we view it unlikely that this policy will change for the index and database stores. We will continue to evaluate support for NFS for the message store as customer demand warrants.
When working with Zimbra Support on related issues, the customer must please disclose that the backup storage used is NFS.
Samba - SMB/CIFS
Global , Distributed , Cloud, Cluster-Type Filesystems - Unsupported

Currently, Zimbra does not support or recommend the use of the various filesystems listed under this section. Please see the specific section to see if I've identified any existing bugs/RFE's against them. One general RFE for this topic is:


GFS
Lustre [Acquired by Sun]
Hadoop Distributed File System (HDFS) - Cluster Filesystem Project From Apache & Yahoo!
SGI CXFS
IBM GPFS
Veritas Storage Foundation Cluster File System
Google Computing

Unsupported at this time by Zimbra Support.

References:

Amazon S3 - And Amazon EC2 Information

Unsupported at this time by Zimbra Support.

References:

Still going to try this when it's not supported? See the following if you run into troubles and if your doing a new setup:

Other Products - Applications (DRBD, Veritas Volume Replicator, etc.)

Unsupported at this time by Zimbra Support.


Background Items To Review Before Drawing Up A Plan


When you start putting all this together you'll find there's allot of "exceptions" to work out based upon your needs, equipment, kernel version, distro choice, SAN equipment, HBA's, OSS drivers vs. vendor ones, and so forth. You must TEST all your assumptions when your planning on using methods that will provide online filesystem growth. Don't deploy just on an assumption. Some situations will not allow a true online filesystem growth.

References to review:

Online Resizing Of LUN Example 1 - Multipathing Required
Warning, this process reminds me of what I was doing over a year ago [before zimbra]. I don't have the necessary hardware or my old notes to go through the steps and confirm.
  1. Resize the SAN volume.
  2. Reinitialize the HBA, e.g. using sg_reset or some module specific method.
  3. Rescan the SCSI device:
    • echo 1 > /sys/block/[sdx]/device/rescan.
  4. Now confirm that /proc/partitions should contain updated values. The fdisk and sfdisk commands may still (most likely) see the old values.
  5. Remove and readd the SCSI Devices.
    • Warning: Without multipathing you are going to loose access to your disk!! Multipathing gives you several paths to the volume, so you will not loose access. Make sure multipathd is up and running.
      • echo "scsi remove-single-device <Host> <Channel> <SCSI ID> <LUN>" > /proc/scsi/scsi
      • echo "scsi add-single-device <Host> <Channel> <SCSI ID> <LUN>" > /proc/scsi/scsi
    • This will also issue new device names!
      • Multipathing updated it's device size automatically after I reloaded the last path-device.
  6. Run:
      • If you have LVM: vgresize, lvresize if you have a lvm setup.
        • Need to double check for the need of pvresize & vgextend use first.
      • Your filesystem online grow commands if you have just a filesystem.
Q&A
  • Q. The use of multipathing failover has the block device being changed in size, potentially during a write (this was, after all, online). What happens if data is being written to the device while you're doing the disconnect/reconnect operations? Do you have two conflicting pieces of information about the device?
    • A. The write is being done at the "filesystem" level, which is unaware of the block device size change.

Example Filesystem Layouts


Local (Server) Disk Configuration

I would have 3 disk subsystems at a minimum. Meaning, three distinction group of disks on their own disk i/o backplane, if possible. For example:

  1. Two disks using the motherboard provided SATA/ATA ports. (OS)
  2. Multiple disks accessed through a dedicated SAN or SCSI card. Maybe these are the hot swap disk array available on the front of your server.(Backup)
  3. Multiple disks accessed through another dedicated SAN or SCSI card. Maybe these are the disks available from an external disk array on your server or to external disks via SAN/iSCSI. (zimbra)
OS - /boot , swap , /tmp , /
Your first disk is referred to as disk1 or sda. Your second disk is referred to as disk2 or sdb. Partitions will start at 1 , rather than 0, throughout my examples below.
For the OS partitions, there's no reason to use anything but "default" for the mounting options in the /etc/fstab. If you reviewed the Performance_Tuning_Guidelines_for_Large_Deployments#File_System wiki page, those settings would apply to a dedicated partition(s) used for zimbra - not the root OS level partitions.
  1. For OS ( /boot , swap, /tmp , / )
    • A. This could be whatever for disks (SATA, SCSI, etc.).
      • 1. Two Disk minimum for mirroring. Four Disks would allow raid 10.
        • A. I prefer to use software raid, because then the "raid" will move with the disks with less complications if my server dies. I can simply move the disks into another server chassis.
        • B. Most motherboards will have at least two SATA ports and drive slots. Make sure you put your drive on different channels if you have the options rather than being on the same.
    • B. /boot partition
      • 1. I would setup a /boot partition (128MB or 256MB) on drive. This would be my first partition on each drive.
        • A. Disk 1 > partition type linux > partition 1 for 128MB or 256MB > filesystem ext3 > no "raiding" > NO LVM > mount point /boot
        • B. Disk 2 > partition type linux > partition 1 for 128MB or 256MB > filesystem ext3 > no "raiding" > NO LVM > mount point /boot2
        • 2. After OS install, I would then "rsync/copy" the contents of /boot into /boot2
        • 3. Configure grub or your bootloader to have the option to boot from /boot2 (disk1/partition0)
          • A. This will give you a failover in case something goes wrong with /boot (disk0/partition0). Bad kernel upgrade or a wacked partition maybe.
        • 4. After any changes to /boot (disk1/partition1) confirm everything works right (confirm reboots fine), you would then do another manual rsync/copy of /boot to /boot2 (disk2/partition1).
    • C. swap
      • 1. Determine how much swap you "need". I will use an example of 2GB's below. Notice both "swap" partitions will get 2 GB's.
      • 2. Setup a swap partition on disk1/partition2 and disk2/partition2 for 2 GB's.
        • A. Disk 1 > partition type - swap > partition 2 for 2GB > filesystem swap > no "raiding" > NO LVM > mount point swap
        • B. Disk 2 > partition type - swap > partition 2 for 2GB > filesystem swap > no "raiding" > NO LVM > not set to mount
      • 3. *Default Suggestion* Configure the OS to only use/mount swap on disk1/partition2. You can configure /etc/fstab for the other partition but just comment out the line for now.
      • 4. Reasoning for this.
        • A. swap drives that are "mirror" can cause undue complications.
        • B. I configured a "same" sized swap on the other drive really for the third partition - that for /. This way the blocking/sizes are as close as possible for the / mirror.
        • C. It isn't really a "lose" of space, but rather for adjustments you might need later.
          • 1. It can server as a "failure" over swap partition incase things go bad on disk0/partition1.
          • 2. It can server for more "production" swap if you find you need it.
          • 3. It allows for a complete disk failover or simplicity in case you need to move the "one" disk to another box.
        • 5. You could also configure the two swap partitions into a raid0 if you would like. Swap partitions can be turned on/off (swapon , swapoff). It's easy enough to reformat them as well, if something goes wrong, after a swapoff.
    • D. /tmp
      • 1. I generally never setup a partition for /tmp , but if you did decide to do this make sure it's following the swap partitions. ( IMO )
        • A. If you do setup /tmp , I would go with ext3 or xfs and put it within LVM. See notes below about LVM use.
    • E. / - Introduction of LVM
      • 1. You'll now place the rest of the "free" disks under a "software" raid1 / mirror partition.
        • A. You should then see this "mirror" as a new disk/partition to be used within the OS installer.
      • 2. Place the "mirror" under LVM - this will be a "partition" type. Let's assume this new mirror is now /dev/md0.
      • 3. Now configure the LVM partition for LVM. I'm assuming that a partition wasn't made for /tmp
        • A. General concepts of what happens with the LVM setup and recommendations on the naming scheme. Sda3 is disk1/partion3 and sdb3 is disk1/partition3
          • 1. pvcreate md0
          • 2. vgcreate -s 128M rootvg md0
            • A. vgchange -a y rootvg
          • 3. lvcreate -L 150GB -n rootlv rootvg
            • A. I put in 150GB above as an example. I would normally put in about 90% of the available space left and leave me some room to create a new partition/mount point if it becomes necessary. For example, let's say / kept filling up because of lazy admins not cleaning up in /root. I could create a new vg group called roothomevg and mount it as /root , restricting the messy admins from effecting / .
          • 4. Now you would make the filesystem.
            • A. Example of setting up ext3 filesystem using defaults.
              • 1. mkfs.ext3 /dev/rootvg/rootlv
          • 5. Now you need to setup the new filesystem in /etc/fstab to mount. I'll use /root as an example, considering the / would of been done through the installer and you would of most likely used the GUI tools to have done all of this.
            • A. mkdir /root (if it didn't exist - this example would of of course)
            • B. vi /etc/fstab
              • 1. /dev/rootvg/rootlv /root defaults 1 2
                • A. Adjust for your system defaults in regards to the device naming convention that's being used.
      • 4. So why did I bother with LVM if I used all the disk space anyways?
        • 1. If you setup the / with LVM at the start, even if you use all the disk space, it allows you in the future to add more disk to the underlying LVM setup to grow the / filesystem - online.
          • A. For example, let's say I have 2 open drive bays that weren't initially used when I setup my server. And two years later I find my / becoming very close to 100%. I can throw in two new drives into those bays (assuming hot-swap drives). Setup a mirror (mdadm) between the two drives. Set the new mirror partition type to LVM. Then run through the pvcreate , vgextend , lvresize , and then online grow the filesystem (ext2online/resize2fs , xfs_growfs , etc.)
Backup
  1. /opt/zimbra/backup
    • A. I would make sure the disk I/O is separate from /opt/zimbra . This way you minimize performance hits to your end-users. Do a review of the disk i/o bus path is clean as possible to the cpu/memory. Motherboard spec's should tell you what "slots" are on shared buses. Make sure your maximizing your raid/san cards performance to the bus path to the cpu.
    • B. I would make this a multiple of some degree of the /opt/zimbra space that your spec'ing. This default backup schedule is to purge sessions older than one month. This means you'll have 4 fulls and 24 incremental sessions.
      • 1. Make sure you investigate the auto-group backup method as well and the zip option. This could have a huge effect on your disk space requirements for the backup partition.
    • C. If disks are local
      • 1. Raid Choices
        • A. If you'll be sending this to tape as well, you could go with a raid0 for performance.
        • B. If you don't plan on sending to tape or another remote storage system, maybe raid5.
      • 2. LVM
        • A. Please encapsulate the backup partition under LVM. This will give you some flexibility later in case you need more space.
      • 2. Other Topics
    • D. If disks are on SAN
      • 1. Raid Choices
        • A. If you'll be sending this to tape as well, you could go with a raid0 for performance.
        • B. If you don't plan on sending to tape or another remote storage system, maybe raid5.
        • C. Note, if you have allot of disks to construct your raid with you still can achieve good performance with raid5. This way you don't have to lose all the "disks" when doing raid10. It would be worth bench marking your SAN using x number of disks configured as raid10 vs. the same x number of disks are raid5. Remember to consider the i/o backplanes involved in what "disks" you choose to use throughout your different disk chassis. Going up and down your disk rack vs left to right.
      • 2. LVM
        • A. Please encapsulate the backup partition under LVM. This will give you some flexibility later in case you need more space.
        • B. If your NAS/SAN system is going to do block level snap-shots, the choice to use LVM or not becomes more complicated. A block level snap-shot across multiple LUN's will generally not work when the top level filesystem is using LVM. If you plan on only using one LUN and growing that same LUN as needed, then LVM will still prove useful if your also using the SAN/NAS block level snap-shots.
      • 2. Other Topics
Zimbra
  1. /opt/zimbra
    • A. Remember you having logging data in here as well. If this partition becomes full, zimbra will hang and could cause database corruption as well.
    • C. If disks are local
      • 1. Raid Choices
        • A. Raid 0 or Raid 10
        • B.
      • 2. LVM
        • A. Please encapsulate the backup partition under LVM. This will give you some flexibility later in case you need more space.
        • B.
    • D. If disks are on SAN / NAS
      • 1. Raid Choices
        • A. Raid 0 or Raid 10
        • B. Note, if you have allot of disks to construct your raid with you still can achieve good performance with raid5. This way you don't have to lose all the "disks" when doing raid10. It would be worth bench marking your SAN using x number of disks configured as raid10 vs. the same x number of disks are raid5. Remember to consider the i/o backplanes involved in what "disks" you choose to use throughout your different disk chassis. Going up and down your disk rack vs left to right.
      • 2. LVM
        • A. Please encapsulate the partitions under LVM. This will give you some flexibility later in case you need more space.
        • B. If your NAS/SAN system is going to do block level snap-shots, the choice to use LVM or not becomes more complicated. A block level snap-shot across multiple LUN's will generally not work when the top level filesystem is using LVM. If you plan on only using one LUN and growing that same LUN as needed, then LVM will still prove useful if your also using the SAN/NAS block level snap-shots.
      • 2. Other Topics
More Details About LVM Use

The notes below were gathered from the "Zimbra Admins in Universities" mailing list.

Contents of Post by Matt on Date: Mon, 27 Oct 2008 14:24:50 -0500

We've found however that we are able to grow storage on the fly with LVM.  
It basically works like this for us...

    *  Grow the LUN on the SAN
          o wait for the LUN to finish growing by checking the 'Jobs / Current Jobs' 
            display until the "Volume Initialization" job is finished. 
    * reboot the host to see the new partition size.
          o (partprobe -s is supposed to do this, but it doesn't) 
    * find the device name:
          o pvs | grep VolumeName? 
    * grow the Volume Group:
          o pvresize -v /dev/XXX 
    * Verify the new size:
          o pvs | grep VolumeName? 
    * grow the logical volume:
          o Grow by a specific size: lvextend --size +NNg /dev/VgName/LvName
          o Grow to use all free space: lvextend -l +100%FREE /dev/VgName/LvName 
    * grow the file system:
          o Online method (dangerous?)
                + ext2online /dev/VgName/LvName 
          o Offline method (safer?)
                + umount /mountpoint
                + e2fsck -f /dev/Vgname/Lvname
                + resize2fs /dev/Vgname/LvName
                + mount /dev/VgName/LvName /mountpoint 
    * Verify new filesystem size:
          o df -h /mountpoint 

I've always used the online method (marked "dangerous?" by one of my cohorts) and 
never had a problem.  One other thing we've been able to do with LVM that has been 
a benefit is migrating data to a new LUN...

   1.  Find the new physical volume that is associated with the correct LUN#. On 
       the Zimbra servers you can use this MPP (linuxrdac) tool.

      # /opt/mpp/lsvdev

   2. Prepare the physical volume with PVCREATE.

      # pvcreate /dev/sdX

   3. Extend the logical volume to the new physical volume with VGEXTEND.

      # vgextend /dev/VolGroupName /dev/sdX

   4. Use LVDISPLAY to make sure you are moving from the right physical volume.

      # lvdisplay /dev/VolGroupName/LogVolName -m

      Example Results
      ===========
        --- Logical volume ---
        LV Name                /dev/VgMbs03Backup/LvMbs03Backup
        VG Name                VgMbs03Backup
        LV UUID                0vZQx3-5A22-a4ZO-4VmV-2naM-jwoi-yc6r6k
        LV Write Access        read/write
        LV Status              available
        # open                 0
        LV Size                580.00 GB
        Current LE             148479
        Segments               1
        Allocation             inherit
        Read ahead sectors     0
        Block device           253:6
         
        --- Segments ---
        Logical extent 0 to 148478:
          Type                linear
          Physical volume     /dev/sdab
          Physical extents    0 to 148478

   5. Move the Volume Group to the new physical volume using PVMOVE.

      # pvmove -i 60 -v /dev/sdZ /dev/sdX

      -i 60     : Show progress every 60 seconds
      -v         : Verbose
      /dev/sdZ     : Physical volume we are moving from
      /dev/sdX     : Physical volume we are moving to

   6. When the move is completed use VGREDUCE to reduce the volume group down to 
      the new physical volume.

      # vgreduce /dev/VolGroupName /dev/sdZ

A reply was given to the post above, noting the issue of the reboot in the steps above. Rich, Mon, 27 Oct 2008 15:24:06 -0500, write:

Any process including the term "reboot" isn't "on the fly." :-)

Current proprietary OSes can rescan and use expanded LUNs on the fly while 
filesystems are mounted. Apparently, so can the latest development Linux kernel,
but stacked device-mapper and LVM layers will need major changes, so don't 
expect to see this capability in enterprise linices for 2 years.

You can save some time, though, by replacing "reboot" with the minimum steps 
required to clear all holders of the device:

umount /file/system
vgchange -an VG
service multipathd stop
multipath -ll # note the physical devices involved, here assuming sd[fg]
multipath -f mpathVG
echo 1 > /sys/block/sdf/device/rescan # partprobe -s might also do this
echo 1 > /sys/block/sdg/device/rescan
multipath -v2
service multipathd start

...and continue with the pvresize. But simply adding a new LUN and marking it 
active with the admin console (or zmvolume) can be done with zero downtime, 
so that's my new model.

SAN Layout As Recommend For Clusters

This is from Multi Node Cluster Installation Guide - PDF

Preparing the SAN

You can place all service data on a single volume or choose to place the service data in multiple volumes. Configure the SAN device and create the partitions for the volumes.

  • Single Volume SAN Mount Point - /opt or /opt/zimbra
    • If you select to configure the SAN in one volume with subdirectories, all service data goes under a single SAN volume.
  • Multiple Volumes For SAN Mount Points
    • If you select to partition the SAN into multiple volumes, the SAN device is partitioned to provide the multiple volumes for each Zimbra mailbox server in the cluster. Example of the type of volumes that can be created follows.
      • /opt Volume for ZCS software (or really, /opt/zimbra/ )
        • Directories under /opt/zimbra/
          • conf Volume for the service-specific configuration files
          • log Volume for the local logs for Zimbra mailbox server
          • redolog Volume for the redo logs for the Zimbra mailbox server
          • db/data Volume for the MySQL data files for the data store
          • store Volume for the message files
          • index Volume for the search index files
          • backup Volume for the backup files
          • logger/db/data Volume for the MySQL data files for logger service’s MySQL instance
          • openldap-data Volume for OpenLDAP data
  • Note, for a multi-volume SAN Cluster, you'll actually create the directory path differently. [ /opt/zimbra-cluster/mountpoints <clusterservicename.com> ] Please see the Cluster Installation Guides for the full planning recommendations and steps if this is what your going to do.

Zimbra Directory Layout & FHS (Filesystem Hierarchy Standard)

FHS

References:

Bugs/RFE's

Community Feedback
  • Feedback Reference One
We are following FHS[1] standard for our deployments (or at least trying 
our best to follow it). It would be nice to reflect on the possibilities 
of mostly FHS-compliant Zimbra deploy. Here's what we've came up with so 
far:

/etc/opt/zimbra for configs
/opt/zimbra - binaries
/var/opt/zimbra - message store, OpenLDAP db, MySQL db's etc.
/var/log/zimbra - logs

Going by the FHS standards (in our case) means deploying a well-documented
system and that its layout is consistent across the board. 

Benefits:
* A paranoid type setup like mounting /opt read-only, and /var as no-exec. 
* For the uber-paranoia, including /etc as read-only. 
* You could tune each FS for specific needs which are consistent across 
  the board. 
** Different FS / or differently tuned FS/ used for each generic case.
* Migrations would be fairly simple, as it's easy to rip out configs (/etc) 
  or data (/var) or logs (/var/log) and copy/move it someplace else. 
* It opens the door to possibility of mounting volume with binaries on 
  multiple machines that only have local configs and data (not that we plan
  on it at the moment).

Disk Or Disk Layout Performance Testing

hdparm - Read or set the hard drive parameters

References:

Using DD

1GB file test

sync ; time bash -c "(dd if=/dev/zero of=largefile bs=1024 count=1000000; sync)"

Now time the removal of the large file.

sync ; time bash -c "(rm largefile; sync)"
Bonnie++

References:

dbench - generate load patterns

References:

IOzone

References:

Stress

References:

Postmark (Netapp)

References:

LTP - Linux Test Project

This suite of tools has filesystem performance tests.

References:

Clustering Software

Please see Ajcody-Clustering

Virtualization

Please see Ajcody-Virtualization

What About Backups? I Need A Plan

Please Review Other Backup-Restore As Well

Please review the Ajcody-Backup-Restore-Issues section as well.

What Might Be Wrong?

First thing to check is the log file and see if anything notable stands out.

grep -i backup /opt/zimbra/log/mailbox.log

Use Auto-Group Backups Rather Than Default Style Backup

Having trouble completing that entire full backup during off-hours? Enter the hybrid auto-grouped mode, which combines the concept of full and incremental backup functions - you’re completely backing up a target number of accounts daily rather than running incrementals.

Auto-grouped mode automatically pulls in the redologs since the last run so you get incremental backups of the remaining accounts; although the incremental accounts captured via the redologs are not listed specifically in the backup account list. This still allows you to do a point in time restore for any account.

Please see the following for more detailed information:

Need To Write Fewer Files - Add The Zip Option To Your Backup Commands

Using the zip option will compress all those thousands of single files that exist under a user's backup, decreasing performance issues that arise from writing out thousands of small files as compared to large ones. This is often seen when one is :

  • Using nfs for the backup directory
  • Copying/rsyncing backups to a remote server
  • Are using some third party backup software (to tape) to archive/backup the zimbra backup sessions.

Please see the following for more information about using the Zip option:

SAN Snapshots For Backups

Please see:

Cloud Backups

Please see:

Tape Backups

I would then use the non-rsync network ports for your traditional network backup software to run over to dump the data to tape. This way that activity doesn't effect prod performance at all. All full DR would use the backup/ data anyways (offsite DR). I've created another section that will deal with this in more details - specifically handling the hard links that are used by Zimbra.

Please see:

Test Environments And Managing Customizations

I have some suggestions on this in the RFE below. The first comment has a recommended layout for your test/qa/dev environments:

Using Vmware ESX For A DEV/QA/Prod Test Environment

Please see Ajcody-Virtualization#Using_VMWare_ESX_For_ZCS_Test_Servers_-_How-To

Creating A Hot-spare Server

Setup along the same lines... though you could cut out some of the HA/performance items if you only see this box being used for "short-term" use. Rsync's will occur over backup network port.

Need to do a sanity check in regards to ldap data. With a normal DR restore, one would do a zmrestoreldap. zmrestoreldap looks to a backup session, there is no real option in regards to a "redolog" directory. Things that are "ldap" only are COS's, DL's, etc..

  1. Setup the hot-spare according to http://wiki.zimbra.com/index.php?title=Ajcody-Notes-ServerMove
    • Basically install the zimbra packages. You'll want to do this with upgrades to production as the system evolves.
  2. I would do an initial rsync of /opt/zimbra (remember to use nice flag to not effect prod to much)
  3. I would then setup 2 daily rsync jobs (following the same wiki instructions)
    1. rsync /opt/zimbra/backup
      • This could be intergrated within the backup cron job so it kicks off after the backup is done. You'll need to monitor the times of backups and also the time for sync's so you can make sure you make it within the window of rotation - backup , rsync , next backup. Times will be different on diff and full backups.
      • rsync other necessary files:
        • /opt/zimbra/conf
        • /opt/zimbra/redolog
        • /opt/zimbra/log
      • This will give some "sanity" in case issues arise with the full restore. This part could use some better feedback from more experience Zimbra staff. I can think of some circumstances where this effort would prove useful.

A Real Hot-spare DR RFE

Note: Please add your votes to this RFE!

If It All Blows Up, Now What?

References:

http://wiki.zimbra.com/index.php?title=Ajcody-Notes-ServerMove

http://wiki.zimbra.com/index.php?title=Network_Edition_Disaster_Recovery



Jump to: navigation, search