Zimbra DR Strategy
Zimbra Backup and Disaster Recovery ("DR") Strategies
There are three basic types of backups in the Zimbra ZCS platform. The best DR strategy is one that uses all three, as each is best suited for particular recovery situations:
- zmbackup - This ZCS user-level backup strategy is best suited for individual, user-level backup and restore. It has point-in-time recovery capabilities with Zimbra redologs, so allows an administrator to restore to a particular point in time in the past when needed. The zmbackup data can be used for a full system restore using "zmrestoreoffline", but please be aware that doing a full DR recovery with zmrestoreoffline can take a very long time (depending on the size of the backup set). Please be sure to test and time a zmrestoreoffline in a lab environment if this is your primary DR strategy.
- Snapshots (Array, Hypervisor or Filesystem level snapshots) - Using snapshots is the best DR strategy for getting a ZCS platform back online quickly. However, depending on your platform's snapshot capabilities or the integrity of the data being restored, the snapshot may or may not allow a full restore to the point immediately before the data loss. Snapshots can be used in combination with zmbackup redologs to restore data that changed after the point of the snapshot.
- Database backups - Database backups of the MySQL/MariaDB and OpenLDAP data can be very useful in the isolated scenario of a damaged database. Doing nightly database backups can provide a very quick method of recovering from a database corruption
Regarding LDAP and MySQL backups, it is critically important that some levels of backups are in use for these databases. Databases can be corrupted due to many factors, such as array problems, network disconnects (if connecting via iSCSI), filesystem problems, or system crashes. These databases must be backed up regularly, in order to recover from a backup if necessary. Typically, the zmbackup utility is used to provide backups of all ZCS data on ZCS systems.
There are usually more than one way to do any DR backup, and each has advantages and disadvantages:
- Snapshots (array-level or hypervisor-level) - fastest time to restore, although need to be taken as close as possible to a single point in time (each mailstore volume per mailstore, and across all mailstores). Databases have a slight risk of corruption with snapshots when run hot, but we haven't seen many problems with this. Can be run with the system cold if want to be 100% sure of data consistency. If using snapshots, it is very important that the redologs also be enabled, because the most recent redolog may need to be replayed to ensure that the most recent transactions are consistent between MySQL and the blob store. Redologs can be replayed using zmplayredo 
- DB backups - slapcat and mysqldump write DB data to flat text file, and so can be good to have in case of serious DB corruption. However, they are a single point in time and generally require restoring the full database set to the point-in-time of the backup. These backups can be run safely while the system is hot. It is absolutely critical that the databases are backed up. If not using zmbackup, then backups may be performed using these techniques - Note: these backups can be run without stopping any services. Using MySQL binary logging can also provide point-in-time recovery for MySQL transactions.
- zmbackup/zmrestore - these are user-level backups and restores, designed for if a user deletes all their email accidentally or wants to roll a restore to a point of time. These backups are full data sets (blobs + DB data) and slow, both for backup and restore. Having to do a full DR from zmbackup data is not something you generally want to do, as it can take quite a long time in a DR situation, but can be used if all other methods of restore fail. However, zmbackup also provides backups of LDAP and MySQL data as well as deletes old redolog/archive/redo*log files, so if disabling zmbackup, other methods must be found to perform these critical tasks.
Prior to upgrades or major system changes, we would generally recommend doing array-level snapshots of all volumes, in case of rollback needs. Doing periodic DB dumps is a good idea for having flat-file data backups (the upgrade process will also write out a backup of the LDAP data automatically).
Again though, it is critical that you get database backups in place immediately. If any database corruption happens in this environment, the entire data set of any mailstore or LDAP master could be permanently lost. It is not often possible to recover a corrupted database, and without any backups in place, would be very very difficult (if even possible) to recover data from the corrupted database. All data on a mailstore or the LDAP master could be lost at any time without a solid, tested backup/DR strategy.
For a complete DR strategy, the optimal approach is to use all three of the above methods in unison. For site-level DR, you'll need to use array-level replication to copy the data to the alternative sites. Recover Time Objective (RTO) and Recover Point Objectives (RPO) need to be considered in order to determine whether synchronous or asychronous methods should be used, and how much data is allowed to be lost in case of a full site-level DR failover. Once designed and in place, the DR strategy must be tested, and additional considerations include planning for how to switch back to the primary site once a DR failover occurs (as failback can often be more difficult then the first failover).
Alternate strategies can include running active nodes in two sites simultaneously, with the ability to fail over in either direction. Latency between sites must be considered in this setup, as well as cost due to the additional storage, servers, and software required to maintain this level of redundancy. For some sites, these costs may very well be worth the time and effort, depending on the level of HA required by the site, as well as the cost of being without email for a period of time.
Note too that even if using multi-site HA, that a bad storage or filesystem-level corruption problem can be accidentally replicated to the failover site. In this case, both sites may have corrupted or damaged data, in which case having a good DR strategy may still be required to be executed.