Server Live sync

Revision as of 10:12, 4 October 2011 by Simonb (talk | contribs) (Zimbra version and platform: Experience of Ubuntu and Network edition gained.)

Zimbra version and platform

This script was developed and tested on Release 7.0.1_GA_3105.RHEL5_64_20110304210645 CentOS5_64 FOSS edition, and now also on the Ubuntu platform and the Network edition.

Introduction

This is an experimental solution to providing near-live synchronisation between two Zimbra servers so that one of them is live and the other is kept in a warm or very warm standby state.

The system is symmetrical. The sync can work in reverse when the mirror server becomes the active server. This allows easy fall-back to the original server once the failover condition is resolved.

Features:

  • LDAP, message store, indexes and metadata kept in sync
  • Mirror server can be brought online in a few minutes
  • Live sync of redolog
  • Minimum bandwidth, only changes to ldap and redolog transfered
  • All communication and data over SSH with unique key
  • Operates as "zimbra" user
  • Works on both Open Source or Network edition
  • Sync can work in either direction but only one way at a time

Preparation

Exactly the same version of Zimbra must be installed on both the live and mirror server. To start with we work on the live server. There is no need to stop Zimbra for most of the install. Only a short amount of down-time will need to be scheduled later to perform a final rsync operation between the two servers.

Install inotify-tools

For Redhat/Centos this is...

As user root:

yum install -y inotify-tools

For Ubuntu this is...

As user root

apt-get install inotify-tools
Create log rotation

The script will create a log file which can be handled by logrotate.

As user root:

echo "/opt/zimbra/live_sync/log/live_sync.log {
    daily
    missingok
    copytruncate
    rotate 7
    notifempty
    compress
}">/etc/logrotate.d/zimbra_live_sync
Create application directory

The script will live under the /opt/zimbra directory.

As user root:

mkdir /opt/zimbra/live_sync
chown zimbra.zimbra /opt/zimbra/live_sync
SSH key

Create the SSH key and just press return every time you are prompted for a passphrase.

As user zimbra:

cd /opt/zimbra/.ssh
ssh-keygen -b 4096 -f live_sync
echo "command=\"/opt/zimbra/live_sync/sync_commands\" $( cat live_sync.pub )">>authorized_keys

Main script

The following script should be saved as live_syncd in the /opt/zimbra/live_sync directory. This should be owned by user zimbra and made executable.

#!/bin/bash
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.

#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.

#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>

##########################################################################
# Title      :  live_syncd
# Author     :  Simon Blandford <simon -at- onepointltd -dt- com>
# Date       :  2011-04-09
# Requires   :  zimbra sync_commands inotify-tools
# Category   :  Administration
# Version    :  1.0.2
# Copyright  :  Simon Blandford, Onepoint Consulting Limited
# License    :  GPLv3 (see above)
##########################################################################
# Description
# Keep two Zimbra servers synchronised in near-realtime
##########################################################################


#******************************************************************************
#********************** Globals ***********************************************
#******************************************************************************

base_dir="/opt/zimbra/live_sync"
locking_dir="$base_dir""/lock"
pid_dir="$base_dir""/pid"
log_dir="$base_dir""/log"
ldap_dir="$base_dir""/ldap"
status_dir="$base_dir""/status"


SSH="ssh -i /opt/zimbra/.ssh/live_sync -o StrictHostKeyChecking=no -o CheckHostIP=no"\
" -o PreferredAuthentications=hostbased,publickey"
lock_dir="$locking_dir""/live_sync.lock"
stop_file="$status_dir""/live_sync.stop"
watches_file="$status_dir""/watches"
log_file="$log_dir""/live_sync.log"
pid_file_ldap="$pid_dir""/ldap_live_sync.pid"
pid_file_redo="$pid_dir""/redo_log_live_sync.pid"
conf_file="$base_dir""/live_sync.conf"

#******************************************************************************
#********************** Functions *********************************************
#******************************************************************************

#Ensure ldap and mysql servers are running and then replay redo logs
replay_redo_logs () {
  ldap status &>/dev/null || ldap start &>/dev/null
  mysql.server status status &>/dev/null || mysql.server start &>/dev/null
  if ! ldap status &>/dev/null || ! mysql.server status status &>/dev/null; then
    echo "Start of local ldap/mysql servers failed" >&2
    ldap status >&2
    mysql.server status >&2
    #Return error to trigger a "break" in while loop
    return 1
  fi
  echo -n "$( date ) :"
  echo "Replaying redologs..."
  if ! zmplayredo >/dev/null; then
    echo "Replay of redolog failed" >&2
    #No error returned here since "break" is not necessary
  fi
  echo -n "$( date ) :"
  echo "Replaying redologs done"
  return 0
}

#The redo log sync daemon
redo_log_live_sync () {
  local stream_pid archived_file i

  echo -n "$( date ) :"
  echo "Starting redo log live sync process"
  #Mailbox process must not be running now
  if zmmailboxdctl status &>/dev/null; then
    zmmailboxdctl stop &>/dev/null
  fi
  if zmmailboxdctl status &>/dev/null; then
    echo "Unable to stop local Zimbra mailbox service" >&2
    return 1
  fi
  echo "Incremental backups enabled : $incremental_backups"
  
  while [ ! -f "$stop_file" ]; do
    while [ ! -f "$stop_file" ]; do
      #Wait for lock directory to be successfully created
      while ! mkdir "$lock_dir" &>/dev/null; do
        sleep 2
      done
      [ -f "$stop_file" ] && break
      #Replay redo logs also at this point if incremental backups are happening in
      #case redo log archives have now suddenly disappeared due to incremental backup
      if [ "x""$incremental_backups" == "xtrue" ]; then
        replay_redo_logs || break
      fi
      echo -n "$( date ) :"
      echo "Syncing redologs..."
      if ! rsync -e "$SSH" -aHz --force --delete \
        "$remote_address"":/opt/zimbra/redolog/" "/opt/zimbra/redolog"; then
        echo "Rsync of redolog failed" >&2
        break
      fi
      echo -n "$( date ) :"
      echo "Syncing redologs done"
      replay_redo_logs || break
      #If there are no incremental backups then remote archive directory will need purging
      if [ "x""$incremental_backups" != "xtrue" ]; then
        echo purge | \
          $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands"
      fi
      #Establish copy-and-live-stream of current redo.log file
      echo stream | \
        $SSH "$remote_address" \
        "/opt/zimbra/live_sync/sync_commands" >"/opt/zimbra/redolog/redo.log" &
      stream_pid=$!
      disown $stream_pid
      #Delay as PID was sometimes not being found if checked immediately
      sleep 5
      #If successfully established stream then sit and wait for move to archive
      if ps $stream_pid | grep "/opt/zimbra/live_sync/sync_commands" &>/dev/null; then
        #Remove lock file, this is resting point
        rmdir "$lock_dir" &>/dev/null
        #Wait for name to be passed of new archive file after redo.log is moved on remote server
        #This is normal resting point of this process
        archived_file=$( echo wait_redo | \
          $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" | \
          tail -n 1 | egrep -o "redo-.*log" )
        #Kill stream
        kill -KILL $( ps aux | grep "/opt/zimbra/live_sync/sync_commands" | \
          grep -v grep | awk '{print $2}' ) &>/dev/null
        #Mirror move operation on local server
        if echo "$archived_file" | egrep "redo-.*log" &>/dev/null; then
          echo "Moving redo.log to $archived_file"
          mv -f "/opt/zimbra/redolog/redo.log" "/opt/zimbra/redolog/archive/""$archived_file"
        else
          echo "Archive file name not found" >&2
        fi
        [ -f "$stop_file" ] && break
      else
        echo "Failed to start redolog streaming, PID=$stream_pid" >&2
        break
      fi
    done
    rmdir "$lock_dir" &>/dev/null
    #Wait 10 minutes for error to error to clear
    i=0
    while [ $(( i++ )) -lt 60 ] && [ ! -f "$stop_file" ]; do
      sleep 10
    done
  done
  echo -n "$( date ) :"
  echo "Ending redo log live sync process"
}

#The ldap sync daemon
ldap_live_sync () {
  local ldap_wait_pid i

  echo -n "$( date ) :"
  echo "Starting ldap live sync process"
  while [ ! -f "$stop_file" ]; do
    while [ ! -f "$stop_file" ]; do
      #Wait for lock directory to be successfully created
      while ! mkdir "$lock_dir" &>/dev/null; do
        sleep 3
      done
      echo -n "$( date ) :"
      echo "Syncing ldap"
      while [ 1 ]; do
        #Check for changes during ldap sync operation
        echo wait_ldap | \
          $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" &>"$watches_file" &
        ldap_wait_pid=$!
        disown $ldap_wait_pid
        if ! ps "$ldap_wait_pid" &>/dev/null; then
          echo "Unable to establish watch on remote LDAP directory, no ldap sync performed"
          break
        fi
        #Wait for watches to be established
        while ! grep "established" "$watches_file" &>/dev/null && \
            ps "$ldap_wait_pid" &>/dev/null; do
          sleep 1
        done
        #Echo out status
        cat "$watches_file"
        rm -f "$watches_file"
        #Rsync remote server to temporary local ldap directory
        if ! rsync -e "$SSH" -aHz --force --delete \
          "$remote_address"":/opt/zimbra/data/ldap/" "$ldap_dir""/"; then
          echo "Rsync of ldap failed" >&2
          break
        fi
        ps $ldap_wait_pid &>/dev/null && break
        echo "Ldap changed during rsync. Re-syncing."
        sleep 10
      done
      kill -KILL $ldap_wait_pid &>/dev/null
      #Stop ldap
      ldap status &>/dev/null && ldap stop &>/dev/null
      if ldap status &>/dev/null; then
        echo "Unable to stop local ldap server" >&2
        break
      fi
      #rsync temporary local ldap directory to real local ldap directory
      rsync -aH "$ldap_dir""/" "/opt/zimbra/data/ldap/"
      #Restart ldap
      ldap status &>/dev/null || ldap start &>/dev/null
      if ! ldap status &>/dev/null; then
        echo "Unable to restart local ldap server" >&2
      fi
      echo -n "$( date ) :"
      echo "Syncing LDAP done"
      rmdir "$lock_dir" &>/dev/null
      [ -f "$stop_file" ] && break
      #Wait for change in remote ldap over 10 minute intervals
      echo wait_ldap | \
        $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" &
      ldap_wait_pid=$!
      disown $ldap_wait_pid
      while [ ! -f "$stop_file" ]; do
        #Restart wait for ldap change if required
        if ! ps $ldap_wait_pid &>/dev/null; then
          echo wait_ldap | \
            $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" &
          ldap_wait_pid=$!
          disown $ldap_wait_pid
        fi
        #Wait 10 minutes
        i=0
        while [ $(( i++ )) -lt 60 ] && [ ! -f "$stop_file" ]; do
          sleep 10
        done
        #If wait process is not still running then there was a change
        ps $ldap_wait_pid &>/dev/null || break
      done
    done
    rmdir "$lock_dir" &>/dev/null
    #Wait 10 minutes for error to error to clear
    i=0
    while [ $(( i++ )) -lt 60 ] && [ ! -f "$stop_file" ]; do
      sleep 10
    done
  done
  echo -n "$( date ) :"
  echo "Ending ldap live sync process"
}

kill_everything () {
  touch "$stop_file"
  kill -KILL $( head -n 1 "$pid_file_ldap" 2>/dev/null ) &>/dev/null
  kill -KILL $( head -n 1 "$pid_file_redo" 2>/dev/null ) &>/dev/null
  kill -KILL $( ps aux | grep "live_syncd start" | grep -v grep | awk '{print $2}' ) &>/dev/null
  kill -KILL $( ps aux | grep "redo_log_live_sync" | grep -v grep | awk '{print $2}' ) &>/dev/null
  kill -KILL $( ps aux | grep "ldap_live_sync" | grep -v grep | awk '{print $2}' ) &>/dev/null
  kill -KILL $( ps aux | \
    grep "/opt/zimbra/live_sync/sync_commands" | grep -v grep | awk '{print $2}' ) &>/dev/null
  rm -f "$stop_file"
  rm -f "$pid_file_ldap"
  rm -f "$pid_file_redo"
  rmdir "$lock_dir" &>/dev/null
}

quitting () {
  echo "Quitting"
  #Kill any hanging processes
  kill_everything
  trap - INT TERM SIGINT SIGTERM
  echo 'kill -KILL $( ps aux | grep live_syncd | grep -v grep | awk '"'"'{print $2}'"'"' ) &>/dev/null' | \
    at now && sleep 1 && rmdir "$lock_dir" &>/dev/null
  exit
}


#******************************************************************************
#********************** Main Program ******************************************
#******************************************************************************

if [ "$( whoami )" != "zimbra" ]; then
  echo "Must run as zimbra user" >&2
  exit 1
fi

mkdir -p "$locking_dir"
mkdir -p "$pid_dir"
mkdir -p "$log_dir"
mkdir -p "$ldap_dir"
mkdir -p "$status_dir"

if [ ! -f "$conf_file" ]; then
  echo "Configuration file, $conf_file, not found" >&2
  exit 1
fi

source "$conf_file"

#Find all local addresses
server_addresses=$( /sbin/ifconfig | grep inet | \
  egrep -io "addr:[[:space:]]*(([0-9]+\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5})" | \
  sed "s/addr://" | tr -d " \t" )

#Check configured server addresses are valid
if ! echo "$server1" | \
    egrep -i "([0-9]+\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5}" &>/dev/null; then
  echo "No valid IP address found for server1 in configuration file" >&2
  exit 1
fi
if ! echo "$server2" | \
    egrep -i "([0-9]+\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5}" &>/dev/null; then
  echo "No valid IP address found for server2 in configuration file" >&2
  exit 1
fi

#Deduce local address and assume other address is remote machine
if echo "$server_addresses" | grep "$server1" &>/dev/null; then
  local_address="$server1"
  remote_address="$server2"
else
  if echo "$server_addresses" | grep "$server2" &>/dev/null; then
    local_address="$server2"
    remote_address="$server1"
  else
    echo "Unable to identify local server address and assume remote address" >&2
    exit 1
  fi
fi

#Check remote server is OK
remote_server_status=$( echo "test" |
  $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" )

if [ "x""$remote_server_status" == "xbusy" ]; then
  echo "Remote server appears to have live_syncd process running" >&2
  echo "This can not run on both servers" >&2
  exit 1
fi

if [ "x""$remote_server_status" != "xOK" ]; then
  echo "Unable to run commands on remote server" >&2
  exit 1
fi

incremental_backups=$( echo "query_incremental" |
  $SSH "$remote_address" "/opt/zimbra/live_sync/sync_commands" )

case $1 in
  start)
    if [ -f  $pid_file_redo ] || [ -f  $pid_file_ldap ]; then
      echo "Proccess already running"
    else
      echo -n "Starting processes..."
      ldap_live_sync >>"$log_file" 2>&1 &
      echo $! >"$pid_file_ldap"
      redo_log_live_sync >>"$log_file" 2>&1 &
      echo $! >"$pid_file_redo"
      echo "done"
    fi
    ;;
  stop)
    touch "$stop_file"
    [ -d "$lock_dir" ] && echo "Waiting for sync operations to complete..."
    while [ -d "$lock_dir" ]; do
      sleep 5
    done
    rm -f "$stop_file"
    replay_redo_logs
    kill_everything
    echo "done"
    ;;
  status)
    if [ -f  $pid_file_redo ] && ps $( head -n 1 $pid_file_redo 2>/dev/null ) &>/dev/null; then
      echo "redo log sync process OK"
      redo_stat=0
    else
      echo "redolog sync process stopped"
      redo_stat=3
    fi
    if [ -f  $pid_file_ldap ] && ps $( head -n 1 $pid_file_ldap 2>/dev/null ) &>/dev/null; then
      echo "ldap sync process OK"
      ldap_stat=0
    else
      echo "ldap sync process stopped"
      ldap_stat=3
    fi
    [ $ldap_stat == 3 ] && [ $redo_stat == 3 ] && exit 3
    [ $ldap_stat == 0 ] && [ $redo_stat == 0 ] && exit 0
    exit 1
    ;;
  kill)
    kill_everything
    ;;
  *)
    trap quitting INT TERM SIGINT SIGTERM
    if ps aux | grep "redo_log_live_sync" | grep -v grep  &>/dev/null || \
        ps aux | grep "ldap_live_sync" | grep -v grep  &>/dev/null; then
      echo "Proccess already running"
    else
      echo "Starting processes in realtime"
      ldap_live_sync &
      echo $! >"$pid_file_ldap"
      redo_log_live_sync &
      echo $! >"$pid_file_redo"
      while [ 1 ]; do sleep 10; done
    fi
esac

Remote command script

The following script should be saved as sync_commands in the /opt/zimbra/live_sync directory. This should be owned by user zimbra and made executable.

#!/bin/bash
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.

#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.

#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>

##########################################################################
# Title      :  sync_commands
# Author     :  Simon Blandford <simon -at- onepointltd -dt- com>
# Date       :  2011-03-30
# Requires   :  zimbra live_syncd inotify-tools
# Category   :  Administration
# Version    :  1.0.0
# Copyright  :  Simon Blandford, Onepoint Consulting Limited
# License    :  GPLv3 (see above)
##########################################################################
# Description
# Keep two Zimbra servers synchronised in near-realtime, local agent
##########################################################################

if [ "$( whoami )" != "zimbra" ]; then
  echo "Must run as zimbra user" >&2
  exit 1
fi

#Check for rsync of redolog or ldap
if echo "$SSH_ORIGINAL_COMMAND" | \
  grep "rsync" | \
    egrep "/opt/zimbra/redolog/|/opt/zimbra/data/ldap/" &>/dev/null; then
  case "$SSH_ORIGINAL_COMMAND" in
  *\&*)
    echo "Rejected"
    ;;
  *\(*)
    echo "Rejected"
    ;;
  *\{*)
    echo "Rejected"
    ;;
  *\;*)
    echo "Rejected"
    ;;
  *\<*)
    echo "Rejected"
    ;;
  *\`*)
    echo "Rejected"
    ;;
  rsync\ --server*)
    $SSH_ORIGINAL_COMMAND
    ;;
  *)
    echo "Rejected"
    ;;
  esac
else
  #Not rsync
  case "$#" in
  0) read command
    ;;
  *) command=$1
    ;;
  esac

  check_inotify () {
    if ! which inotifywait &>/dev/null; then
      echo "inotifywait not found" >&2
      echo "Please install inotify-tools" >&2
      exit 1
    fi
  }

  case $command in
    test)
      if ps aux | grep "live_syncd" | grep -v grep &>/dev/null; then
        echo "busy"
      else
        echo "OK"
      fi
      ;;
    wait_redo)
      #Wait for redo log roll-over
      check_inotify
      inotifywait -r /opt/zimbra/redolog -e moved_to
      ;;
    wait_ldap)
      #Wait for ldap changes
      check_inotify
      inotifywait -r /opt/zimbra/data/ldap -e modify \
        -e attrib -e close_write -e moved_to -e moved_from \
        -e move -e delete -e delete_self
      ;;
    stream)
      #Live-stream redolog
      tail -c +0 -f /opt/zimbra/redolog/redo.log
      ;;
    purge)
      #Remove old archives
      find /opt/zimbra/redolog/archive -mtime +1 -exec rm {} \;
      ;;
    query_incremental)
      #Query whether incremental backups are scheduled
      if which zmschedulebackup &>/dev/null && \
          zmschedulebackup -q | \
          egrep -o "i([[:space:]]+[0-9\*\-]+){5}" &>/dev/null; then
        echo "true"
      else
        echo "false"
      fi
      ;;
    *)
      rsync
      ;;
  esac
fi

Configuration File

The configuration file simply contains the IP addresses of the live and mirror server. The order is not important since this is worked out by the script by seeing which IP address is assigned to the local machine. The configuration file name is saved as live_sync.conf and saved in the /opt/zimbra/live_sync directory and readable by user zimbra. The following is an example, you obviously should use the real IP addresses of your own live and mirror servers.

server1="192.168.108.10"
server2="192.168.108.11"

Enabling redo.log

For the Network edition, redo logs are already being created and are periodically moved to create incremental backups. For the open source version redo logs archiving must be enabled.

To see the current redo log related settings type the following as user zimbra

zmprov gacf | grep "RedoLog"

To enable redo log rollover on the open source version, type...

zmprov mcf zimbraRedoLogDeleteOnRollover FALSE
zmprov mcf zimbraRedoLogEnabled TRUE

You may also want to make the redo log rotation more frequent to guarantee a file-system consistent redo log on the mirror server at least up to the last, say, thirty minutes. The live-streamed redo.log may not be consistent although it is unlikely this will ever be a problem except with the very last record in the log.

For example, to force rollover every half and hour, type...

zmprov mcf zimbraRedoLogRolloverFileSizeKB 1
zmprov mcf zimbraRedoLogRolloverMinFileAge 30

This will rollover if the size of the redo log is over 1KB after 30 mins, which is very likely unless the mail server is not sending or receiving any mail at all during this time.

You may want to reduce the zimbraRedoLogRolloverMinFileAge even further while setting and testing this script just so you don't have to wait too long to see stuff happening between the severs.

Mirror Server

The mirror server should ideally have the same operating system as the live server and must have exactly the same version of Zimbra installed.

Install inotify-tools

For Redhat/Centos this is...

As user root:

yum install -y inotify-tools

For Ubuntu this is...

As user root

apt-get install inotify-tools
Create log rotation

The logrotate configuration also needs to be done on the mirror server

As user root:

echo "/opt/zimbra/live_sync/log/live_sync.log {
    daily
    missingok
    copytruncate
    rotate 7
    notifempty
    compress
}">/etc/logrotate.d/zimbra_live_sync
First rsync between server

We now perform the first copy of the zimbra directory between the live and mirror server. On the mirror server we must stop Zimbra. We leave Zimbra running on the live server for now to reduce downtime.

The following rsync command is run on the mirror server. Substitute the hostname or IP address of the live server as required in the command below.

As user root:

service zimbra stop
rsync -aHz --force --delete live_server:/opt/zimbra/ /opt/zimbra/
Second rsync between server

This is where we need to stop Zimbra on the live server so that we can copy a consistent /opt/zimbra directory from the live to the mirror server. This is the only downtime required.

On the live server as user root:

service zimbra stop

On the mirror server as user root:

rsync -aHz --force --delete live_server:/opt/zimbra/ /opt/zimbra/

On the live server as user root:

service zimbra start

On the mirror server as user root (just to make sure we have a viable copy of Zimbra):

service zimbra start
service zimbra status
service zimbra stop

Running the script

Not only have we copied all the Zimbra data from the live to mirror server, we have also copied the script and SSH keys. We should now be able to try running the script.

On the mirror server as user zimbra:

cd /opt/zimbra/live_sync
./live_syncd start

All being well the script has started without any complaints and we can now tail the log file to see that it is syncing as expected.

tail -f log/live_sync.log

(CTRL-C to exit tail command)

Failover

If the live server fails then the procedure on the mirror server is simply to stop the live_sync script then start Zimbra.

su - zimbra
cd live_sync
./live_syncd stop
zmcontrol start

Fallback

Simply run the script on the server to fail back to i.e. live and mirror are now reversed.

As user zimbra (on ex-live server to be restored back to live):

cd /opt/zimbra/live_sync
./live_syncd start

Once the script has caught up and synced the two servers together. Stop Zimbra on the other server.

As user zimbra on mirror (failover server)

zmcontrol stop

As user zimbra on live (restored server)

cd /opt/zimbra/live_sync
./live_syncd stop
zmcontrol zimbra start

Warm or very warm standby

To replay redo logs only requires that the mailbox process is stopped. This is done automatically by the script. The script will work whether Zimbra has been started on the mirror or not as it will enable or disable services as and when it needs them. Keeping the rest of Zimbra running will drastically reduce the time it takes to fail over. This is only an advantage when access to the server domain can be quickly flipped or has a failover mechanism.

How and why it all works

Introduction

Zimbra employs several different databases to store messages, message indexes, meta-data, account information and configuration. Although it is possible to synchronise two Zimbra servers at the disk level using DRBD or VSphere, the amount of disk operations from all these databases that need to be replicated would probably take up a lot of bandwidth which may be debilitating and/or expensive to implement if the two servers are in remote locations.

Fortunately, Zimbra keeps a log of almost all it's transactions in the redolog. The only thing not logged here are changes to the LDAP database. An incremental backup is made up of an LDAP dump and a collection of redologs. An incremental backup can be used to bring a backup server up to date if the if the last full backup of the backup server was more recent than the oldest log in the redolog.

Redolog

If the redolog can be piped to a mirror server in real time then all the mirror server has to do is keep replaying the logs every so often and it will keep the same state as the live server. The only other thing to keep up to date is the LDAP database. Fortunately, the LDAP database doesn't change that often so it is quite easy to keep it synced on a directory level.

The easiest way to transfer the redologs is to use rsync. The only problem with that is that rsync does not run continuously. It also won't handle the archiving of redo.log very efficiently. When redo.log is renamed and moved to the archive rsync will delete it at the remote location then transfer it all over again to its new location in the archive. If we can catch this move taking place then we can move and rename the file on the mirror server before running rsync. Then rsync has very little to do, in theory, nothing except delete any files that have been purged. Another issue with rsync is that the file may be in the process of being written to when it is copied. This results in an incomplete file at the mirror. However, redologs are only ever appended to and so only the last record will be corrupted. Zimbra is designed to be tolerant of redolog corruption otherwise it would be of limited use as a disaster recovery tool.

To keep the redolog live, the "tail -f" command is used over ssh to pipe the file to the mirror. By calling "tail -f -c +0" it tails right back to the zeroth byte of the file, effectively a copy-then-stream command.

Redolog purging

If a Network edition is detected and incremental backups are enabled then the redologs are replayed before any rsync is performed as well as after. This ensures that everything is replayed before the files all disappear to the backup directory.

For the open source edition, or Network edition with no incremental backups scheduled, the redologs are purged if they are more than a day old and have been replayed. If the mirror server is down then the redologs will just accumulate on the live server to be replayed when the sync process is restarted before being purged.

LDAP

LDAP stores it's data in /opt/zimbra/data/ldap. This can be copied using rsync to the mirror as long as no changes take place during the copy. The directory is monitored for this during the rsync operation and repeated if there was any change during that time. I am slightly concerned as to how this may pan out on really busy servers but seems viable on the low hundreds of users that I have tested.

Known issues

  • If the connection breaks at the very moment that the live stream of the redo.log starts, before the tail command reaches the point where it is tailing instead of cataloging the file, then some of the redo.log will not make it to the mirror resulting in some loss of transactions. Fortunately, this is only ever likely very just after the log has rolled over so the worst-case losses should be minimal.
  • LDAP is only checked every ten minutes so some losses are possible if the connection breaks in that time. However, LDAP isn't expected to change very often unless something major like a batch account migration is taking place.
Jump to: navigation, search