diff options
| author | Alan Conway <aconway@apache.org> | 2014-05-05 14:20:53 +0000 |
|---|---|---|
| committer | Alan Conway <aconway@apache.org> | 2014-05-05 14:20:53 +0000 |
| commit | 304b0dbebc28597538b79472e97af47d4b13a7f4 (patch) | |
| tree | 5023c63c2410a42ae100db1fdbd84599e5c8ab2e | |
| parent | e5e20395b5948a055ab33455eb1a1fc25e81c210 (diff) | |
| download | qpid-python-304b0dbebc28597538b79472e97af47d4b13a7f4.tar.gz | |
NO-JIRA: HA Added troubleshooting section to the user documentation.
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1592540 13f79535-47bb-0310-9956-ffa450edef68
| -rw-r--r-- | qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml | 245 |
1 files changed, 223 insertions, 22 deletions
diff --git a/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml b/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml index bd225cbd25..6e0225a2af 100644 --- a/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml +++ b/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml @@ -54,7 +54,7 @@ under the License. <title>Avoiding message loss</title> <para> In order to avoid message loss, the primary broker <emphasis>delays - acknowledgment</emphasis> of messages received from clients until the + acknowledgement</emphasis> of messages received from clients until the message has been replicated and acknowledged by all of the back-up brokers, or has been consumed from the primary queue. </para> @@ -414,9 +414,9 @@ ssl_addr = "ssl:" host [":" port]' <para> Once all components are installed it is important to take the following step: <programlisting> - chkconfig rgmanager on - chkconfig cman on - chkconfig qpidd <emphasis>off</emphasis> +chkconfig rgmanager on +chkconfig cman on +chkconfig qpidd <emphasis>off</emphasis> </programlisting> </para> <para> @@ -429,7 +429,7 @@ ssl_addr = "ssl:" host [":" port]' be stopped when in fact there is a <literal>qpidd</literal> process running. The <literal>qpidd</literal> log will show errors like this: <programlisting> - critical Unexpected error: Daemon startup failed: Cannot lock /var/lib/qpidd/lock: Resource temporarily unavailable +critical Unexpected error: Daemon startup failed: Cannot lock /var/lib/qpidd/lock: Resource temporarily unavailable </programlisting> </para> </note> @@ -537,8 +537,8 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <filename>qpidd.conf</filename> should contain these lines: </para> <programlisting> - ha-cluster=yes - ha-brokers-url=20.0.20.1,20.0.20.2,20.0.20.3 +ha-cluster=yes +ha-brokers-url=20.0.20.1,20.0.20.2,20.0.20.3 </programlisting> <para> The brokers connect to each other directly via the addresses @@ -587,7 +587,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <title>Controlling replication of queues and exchanges</title> <para> By default, queues and exchanges are not replicated automatically. You can change - the default behavior by setting the <literal>ha-replicate</literal> configuration + the default behaviour by setting the <literal>ha-replicate</literal> configuration option. It has one of the following values: <itemizedlist> <listitem> @@ -624,14 +624,14 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <command>qpid-config</command> management tool like this: </para> <programlisting> - qpid-config add queue myqueue --replicate all +qpid-config add queue myqueue --replicate all </programlisting> <para> To create replicated queues and exchanges via the client API, add a <literal>node</literal> entry to the address like this: </para> <programlisting> - "myqueue;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}" +"myqueue;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}" </programlisting> <para> There are some built-in exchanges created automatically by the broker, these @@ -714,18 +714,18 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl The full grammar for the URL is: </para> <programlisting> - url = ["amqp:"][ user ["/" password] "@" ] addr ("," addr)* - addr = tcp_addr / rmda_addr / ssl_addr / ... - tcp_addr = ["tcp:"] host [":" port] - rdma_addr = "rdma:" host [":" port] - ssl_addr = "ssl:" host [":" port]' +url = ["amqp:"][ user ["/" password] "@" ] addr ("," addr)* +addr = tcp_addr / rmda_addr / ssl_addr / ... +tcp_addr = ["tcp:"] host [":" port] +rdma_addr = "rdma:" host [":" port] +ssl_addr = "ssl:" host [":" port]' </programlisting> </footnote> You also need to specify the connection option <literal>reconnect</literal> to be true. For example: </para> <programlisting> - qpid::messaging::Connection c("node1,node2,node3","{reconnect:true}"); +qpid::messaging::Connection c("node1,node2,node3","{reconnect:true}"); </programlisting> <para> Heartbeats are disabled by default. You can enable them by specifying a @@ -733,7 +733,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <literal>heartbeat</literal> option. For example: </para> <programlisting> - qpid::messaging::Connection c("node1,node2,node3","{reconnect:true,heartbeat:10}"); +qpid::messaging::Connection c("node1,node2,node3","{reconnect:true,heartbeat:10}"); </programlisting> </section> <section id="ha-python-client"> @@ -746,7 +746,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <literal>Connection.open</literal> </para> <programlisting> - connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"]) +connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"]) </programlisting> <para> Heartbeats are disabled by default. You can @@ -754,7 +754,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl connection via the 'heartbeat' option. For example: </para> <programlisting> - connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"], heartbeat=10) +connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"], heartbeat=10) </programlisting> </section> <section id="ha-jms-client"> @@ -864,7 +864,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <literal>ha-username</literal>=<replaceable>USER</replaceable> </para> <programlisting> - acl allow <replaceable>USER</replaceable>@QPID all all +acl allow <replaceable>USER</replaceable>@QPID all all </programlisting> </section> @@ -886,7 +886,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <para> To test if a broker is the primary: <programlisting> - qpid-ha -b <replaceable>broker-address</replaceable> status --expect=primary +qpid-ha -b <replaceable>broker-address</replaceable> status --expect=primary </programlisting> This command will return 0 if the broker at <replaceable>broker-address</replaceable> is the primary, non-0 otherwise. @@ -894,7 +894,7 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl <para> To promote a broker to primary: <programlisting> - qpid-ha -b <replaceable>broker-address</replaceable> promote +qpid-ha -b <replaceable>broker-address</replaceable> promote </programlisting> </para> <para> @@ -916,4 +916,205 @@ NOTE: fencing is not shown, you must configure fencing appropriately for your cl </para> </section> + <section id="ha-troubleshoot"> + <title>Troubleshooting a cluster</title> + <para> + This section applies to clusters that are using rgmanager as the + cluster manager. + </para> + <section id="authentication-failures"> + <title>Authentication failures</title> + <para> + If a broker is unable to establish a connection to another broker + in the cluster due to authentication problems, the log will + contain SASL errors, for example: + <programlisting> +2012-aug-04 10:17:37 info SASL: Authentication failed: SASL(-13): user not found: Password verification failed + </programlisting> + </para> + <para> + Set the SASL user name and password used to connect to other + brokers using the ha-username and ha-password properties when you + start the broker. Set the SASL mode using ha-mechanism. Any + mechanism you enable for broker-to-broker communication can also + be used by a client, so do not enable ha-mechanism=ANONYMOUS in a + secure environment. Once the cluster is running, run qpid-ha to + make sure that the brokers are running as one cluster. + </para> + </section> + <section id="slow-recovery-times"> + <title>Slow recovery times</title> + <para> + The following configuration settings affect recovery time. The + values shown are examples that give fast recovery on a lightly + loaded system. You should run tests to determine if the values are + appropriate for your system and load conditions. + </para> + <section id="cluster.conf"> + <title>cluster.conf:</title> + <programlisting> +<rm status_poll_interval=1> + </programlisting> + <para> + status_poll_interval is the interval in seconds that the + resource manager checks the status of managed services. This + affects how quickly the manager will detect failed services. + </para> + <programlisting> +<ip address="20.0.20.200" monitor_link="yes" sleeptime="0"/> + </programlisting> + <para> + This is a virtual IP address for client traffic. + monitor_link="yes" means monitor the health of the network interface + used for the VIP. sleeptime="0" means don't delay when + failing over the VIP to a new address. + </para> + </section> + <section id="qpidd.conf"> + <title>qpidd.conf</title> + <programlisting> +link-maintenance-interval=0.1 + </programlisting> + <para> + Interval for backup brokers to check the link to the primary + re-connect if need be. Default 2 seconds. Can be set lower for + faster fail-over. Setting too low will result in excessive + link-checking activity on the broker. + </para> + <programlisting> +link-heartbeat-interval=5 + </programlisting> + <para> + Heartbeat interval for federation links. The HA cluster uses + federation links between the primary and each backup. The + primary can take up to twice the heartbeat interval to detect a + failed backup. When a sender sends a message the primary waits + for all backups to acknowledge before acknowledging to the + sender. A disconnected backup may cause the primary to block + senders until it is detected via heartbeat. + </para> + <para> + This interval is also used as the timeout for broker status + checks by rgmanager. It may take up to this interval for + rgmanager to detect a hung broker. + </para> + <para> + The default of 120 seconds is very high, you will probably want + to set this to a lower value. If set too low, under network + congestion or heavy load, a slow-to-respond broker may be + re-started by rgmanager. + </para> + </section> + </section> + <section id="total-cluster-failure"> + <title>Total cluster failure</title> + <para> + The cluster can only guarantee availability as long as there is at + least one active primary broker or ready backup broker left alive. + If all the brokers fail simultaneously, the cluster will fail and + non-persistent data will be lost. + </para> + <para> + To explain this better, note that brokers are in one of 4 states: + - standalone: not part of a HA cluster - joining: newly started + backup, not yet joined to the cluster. - catch-up: backup has + connected to the primary and is downloading queues, messages etc. + - ready: backup is connected and actively replicating from + primary, it is ready to take over. - recovering: newly-promoted to + primary, waiting for backups to catch up before serving clients. + Only a single primary broker can be recovering at a time. - + active: serving clients, only a single primary broker can be + active at a time. + </para> + <para> + While there is an active primary broker, clients can get service. + If the active primary fails, one of the "ready" backup + brokers will take over, recover and become active. Note a backup + can only be promoted to primary if it is in the "ready" + state (with the exception of the first primary in a new cluster + where all brokers are in the "joining" state) + </para> + <para> + Given a stable cluster of N brokers with one active primary and + N-1 ready backups, the system can sustain up to N-1 failures in + rapid succession. The surviving broker will be promoted to active + and continue to give service. + </para> + <para> + However at this point the system <emphasis>cannot</emphasis> + sustain a failure of the surviving broker until at least one of + the other brokers recovers, catches up and becomes a ready backup. + If the surviving broker fails before that the cluster will fail in + one of two modes (depending on the exact timing of failures) + </para> + <section id="the-cluster-hangs"> + <title>1. The cluster hangs</title> + <para> + All brokers are in joining or catch-up mode. rgmanager tries to + promote a new primary but cannot find any candidates and so + gives up. clustat will show that the qpidd services are running + but the the qpidd-primary service has stopped, something like + this: + </para> + <programlisting> +Service Name Owner (Last) State +------- ---- ----- ------ ----- +service:mrg33-qpidd-service 20.0.10.33 started +service:mrg34-qpidd-service 20.0.10.34 started +service:mrg35-qpidd-service 20.0.10.35 started +service:qpidd-primary-service (20.0.10.33) stopped + </programlisting> + <para> + Eventually all brokers become stuck in "joining" mode, + as shown by qpid-ha status --all. + </para> + <para> + At this point you need to restart the cluster in one of the + following ways: Restart the entire cluster: - In + luci:<replaceable>your-cluster</replaceable>:Nodes click reboot to restart the entire + cluster. - OR stop and restart the cluster with ccs --stopall; + ccs --startall Restart just the Qpid services: - In + luci:<replaceable>your-cluster</replaceable>:Service Groups - select all the qpidd (not + primary) services, click restart - select the qpidd-primary + service, click restart - OR stop the primary and qpidd services + with clusvcadm, then restart (primary last) + </para> + </section> + <section id="the-cluster-reboots"> + <title>2. The cluster reboots</title> + <para> + A new primary is promoted and the cluster is functional but all + non-persistent data from before the failure is lost. + </para> + </section> + </section> + <section id="fencing-and-network-partitions"> + <title>Fencing and network partitions</title> + <para> + A network partition is a a network failure that divides the + cluster into two or more sub-clusters, where each broker can + communicate with brokers in its own sub-cluster but not with + brokers in other sub-clusters. This condition is also referred to + as a "split brain". + </para> + <para> + Nodes in one sub-cluster can't tell whether nodes in other + sub-clusters are dead or are still running but disconnected. We + cannot allow each sub-cluster to independently declare its own + qpidd primary and start serving clients, as the cluster will + become inconsistent. We must ensure only one sub-cluster continues + to provide service. + </para> + <para> + A <emphasis>quorum</emphasis> determines which sub-cluster + continues to operate, and <emphasis>power fencing</emphasis> + ensures that nodes in non-quorate sub-clusters cannot attempt to + provide service inconsistently. For more information see: + </para> + <para> + https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/High_Availability_Add-On_Overview/index.html, + chapter 2. Quorum and 4. Fencing. + </para> + </section> + </section> </section> |
