Difference between revisions of "Cluster troubleshooting"
(Guide to troubleshooting problems with ZCS and Red Hat Cluster Suite)
Revision as of 16:14, 21 July 2009
This document is intended to provide solutions to some common problems encountered on ZCS systems using Red Hat Cluster Suite. It is not intended as a substitute for RHCS's documentation or the assistance of Red Hat Technical Support in cases related to direct failure of the RHCS software itself.
This section describes some common problems encountered by administrators of clustered RHCS and their resolution.
ZCS Software Fails to Start
In this situation, running 'clustat' (as root) will show all services and cluster nodes present, but the state of the service will be 'disabled' or 'failed'. Attempts to start the service using 'clusvcadm -e' will not succeed. Usually when this type of problem occurs, it is not related to the RHCS software itself. The problem is caused by a misconfiguration or other error preventing ZCS from starting. To correct the problem, the cluster software must be taken out of the picture. To do this, the admin will need to disable the clustered service, manually mount the disk volumes and IP addresses associated with it, and then directly repair the ZCS installation.
(as root): clusvcadm -d <service_name> ip addr (confirm that the virtual IP is not enabled) mount (confirm that the cluster mountpoints are not mounted) ps -ef | grep zimbra (confirm that ZCS services are not running) ip addr add <cluster_service_virtual_ip> dev <device> mount <physical_disk_location> <mountpoint>
At this point, the ZCS system is ready to be started. An admin can use a standard 'zmcontrol start' to attempt to bring services up. This will likely fail, but the error message will give an indication of the service experiencing the problem, and the service may be repaired. When the service is able to start cleanly, then the Virtual IP and disk mountpoints may be removed, and the service may be started again under cluster control:
(as zimbra): zmcontrol stop (as root): ip addr delete <cluster_service_virtual_ip> dev <device> umount <mountpoint> clusvcadm -e <service_name> -m <host>
Adding the -m option to clusvcadm is not essential, but is a good practice to ensure that the system is coming up on the expected server. If the ZCS service is repaired, startup should complete correctly.
Clustered Service Repeatedly Fails and Restarts
In this situation, the 'clusvcadm -e' command works correctly and the service starts up, but shortly after startup completes, the service fails and the cluster software attempts to fail it over, either to another node or to restart it in place. This happens because the ZCS startup completed normally, but at some point after startup, an essential service crashed. The repair procedure for this problem is the same as for cases in which ZCS does not start at all. The clustered service must be disabled and all mountpoints and Virtual IP's brought online. The service can then be manually started and the failing service identified and repaired.