QPID-5719: HA becomes unresponsive once any of the brokers are SIGSTOPed

- Added timeout to qpid-ha. - qpidd init script pings broker to verify it is not hung. - updated documentation in qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml. The new results for the cases mentioned in the bug: a] stopped ALL brokers: rgmanager restarts the entire cluster but data is lost. Equivalent to killing all the brokers at once. This does not affect quorum because only qpidd services are affected, not other services managed by cman. b] stopped the primary: rgmanager restarts the primary after a timeout and promotes one of the backups. c] stopped a backup: rgmanager restarts the backups after a timeout. Clients that are actively sending messages may see a delay while backup is restarted. Note you need to set link-heartbeat-interval in qpidd.conf. The default is very high (120 seconds), it should be set lower to see recovery from sigstop in a reasonable time. See the updated documentation in qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1589807 13f79535-47bb-0310-9956-ffa450edef68
author: Alan Conway <aconway@apache.org> 2014-04-24 17:54:05 +0000
committer: Alan Conway <aconway@apache.org> 2014-04-24 17:54:05 +0000
commit: 1d3b4560f8a7f212976b536376a976b3b41f489b (patch)
tree: 82c4baadc8f4159bea4fa8ad872f9858061c727e /qpid/doc/book
parent: 67f29e0685b4bfaa0721a25ae901c3b5e18c0db3 (diff)
download: qpid-python-1d3b4560f8a7f212976b536376a976b3b41f489b.tar.gz
1 files changed, 21 insertions, 8 deletions
diff --git a/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml b/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml
index 4a4b8d9a5c..0a1cbc5e3d 100644
--- a/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml
+++ b/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml
@@ -335,9 +335,9 @@ ssl_addr = "ssl:" host [":" port]'
 	    </entry>
 	    <entry>
 	      <para>
-		Interval for the broker to check link health and re-connect links if need
-		be. If you want brokers to fail over quickly you can set this to a
-		fraction of a second, for example: 0.1.
+		Interval for backup brokers to check the link to the primary re-connect if need be.
+		Default 2 seconds. Can be set lower for faster failover, e.g. 0.1 seconds.
+		Setting it too low will result in excessive link-checking activity on the brokers.
 	      </para>
 	    </entry>
 	  </row>
@@ -348,8 +348,12 @@ ssl_addr = "ssl:" host [":" port]'
 	    </entry>
 	    <entry>
 	      <para>
-		Heartbeat interval for replication links. The link will be assumed broken
-		if there is no heartbeat for twice the interval.
+		Heartbeat interval for replication links and timeout for broker status checks.
+		It may take up to this interval for rgmanager to detect a hung or partitioned broker.
+		The primary may take up to twice this interval to detect a hung or partitioned backup.
+		Clients sending messages may be held up during this time.
+		Default 120 seconds: you will probably want to set this to a lower value e.g. 10.
+		If set too low, a slow broker may be considered as failed and killed.
 	      </para>
 	    </entry>
 	  </row>
@@ -430,8 +434,13 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl
     <clusternode name="node2.example.com" nodeid="2"/>
     <clusternode name="node3.example.com" nodeid="3"/>
   </clusternodes>
+
   <!-- Resouce Manager configuration. -->
-  <rm>
+
+   status_poll_interval is the interval in seconds that the resource manager checks the status
+   of managed services. This affects how quickly the manager will detect failed services.
+   -->
+  <rm status_poll_interval="1">
     <!--
 	There is a failoverdomain for each node containing just that node.
 	This lets us stipulate that the qpidd service should always run on each node.
@@ -455,8 +464,12 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl
       <!-- This script promotes the qpidd broker on this node to primary. -->
       <script file="/etc/init.d/qpidd-primary" name="qpidd-primary"/>
 
-      <!-- This is a virtual IP address for client traffic. -->
-      <ip address="20.0.20.200" monitor_link="1"/>
+      <!--
+          This is a virtual IP address for client traffic.
+	  monitor_link="yes" means monitor the health of the NIC used for the VIP.
+	  sleeptime="0" means don't delay when failing over the VIP to a new address.
+      -->
+      <ip address="20.0.20.200" monitor_link="yes" sleeptime="0"/>
     </resources>
 
     <!-- There is a qpidd service on each node, it should be restarted if it fails. -->
author	Alan Conway <aconway@apache.org>	2014-04-24 17:54:05 +0000
committer	Alan Conway <aconway@apache.org>	2014-04-24 17:54:05 +0000
commit	1d3b4560f8a7f212976b536376a976b3b41f489b (patch)
tree	82c4baadc8f4159bea4fa8ad872f9858061c727e /qpid/doc/book
parent	67f29e0685b4bfaa0721a25ae901c3b5e18c0db3 (diff)
download	qpid-python-1d3b4560f8a7f212976b536376a976b3b41f489b.tar.gz