summaryrefslogtreecommitdiff
path: root/ctdb/server/ctdb_recovery_helper.c
diff options
context:
space:
mode:
authorAmitay Isaacs <amitay@gmail.com>2016-06-02 18:27:29 +1000
committerMartin Schwenke <martins@samba.org>2016-06-06 08:49:15 +0200
commit93dcca2a5f7af9698c9ba1024dbce1d1a66d4efb (patch)
tree46fa7b6c30f594f6bb24c8b633994f3bd417fe55 /ctdb/server/ctdb_recovery_helper.c
parent82a10942d4e88b474b2e87f53cf2d82977e596e0 (diff)
downloadsamba-93dcca2a5f7af9698c9ba1024dbce1d1a66d4efb.tar.gz
ctdb-recovery: Update timeout and number of retries during recovery
The timeout RecoverTimeout (default 120) is used for control messages sent during the recovery. If any of the nodes does not respond to any of the recovery control messages for RecoverTimeout seconds, then it will cause a failure of recovery of a database. Recovery helper will retry the recovery for a database 5 times. In the worst case, if a database could not be recovered within 5 attempts, a total of 600 seconds would have passed. During this time period other timeouts will be triggered causing unnecessary failures as follows: 1. During the recovery, even though recoverd is processing events, it does not send a ping message to ctdb daemon. If a ping message is not received for RecdPingTimeout (default 60) seconds, then ctdb will count it as unresponsive recovery daemon. If the recovery daemon fails for RecdFailCount (default 10) times, then ctdb daemon will restart recovery daemon. So after 600 seconds, ctdb daemon will restart recovery daemon. 2. If ctdb daemon stays in recovery for RecoveryDropAllIPs (default 120), then it will drop all the public addresses. This will cause all SMB client to be disconnected unnecessarily. The released public addresses will not be taken over till the recovery is complete. To avoid dropping of IPs and restarting recovery daemon during a delayed recovery, adjust RecoverTimeout to 30 seconds and limit number of retries for recovering a database to 3. If we don't hear from a node for more than 25 seconds, then the node is considered disconnected. So 30 seconds is sufficient timeout for controls during recovery. Signed-off-by: Amitay Isaacs <amitay@gmail.com> Reviewed-by: Martin Schwenke <martin@meltin.net> Autobuild-User(master): Martin Schwenke <martins@samba.org> Autobuild-Date(master): Mon Jun 6 08:49:15 CEST 2016 on sn-devel-144
Diffstat (limited to 'ctdb/server/ctdb_recovery_helper.c')
-rw-r--r--ctdb/server/ctdb_recovery_helper.c4
1 files changed, 2 insertions, 2 deletions
diff --git a/ctdb/server/ctdb_recovery_helper.c b/ctdb/server/ctdb_recovery_helper.c
index 0720d0e2ca6..d54f32db04e 100644
--- a/ctdb/server/ctdb_recovery_helper.c
+++ b/ctdb/server/ctdb_recovery_helper.c
@@ -34,9 +34,9 @@
#include "protocol/protocol_api.h"
#include "client/client.h"
-static int recover_timeout = 120;
+static int recover_timeout = 30;
-#define NUM_RETRIES 5
+#define NUM_RETRIES 3
#define TIMEOUT() timeval_current_ofs(recover_timeout, 0)