summaryrefslogtreecommitdiff
path: root/ctdb/doc/recovery-process.txt
diff options
context:
space:
mode:
authorRonnie Sahlberg <ronniesahlberg@gmail.com>2008-12-04 14:35:00 +1100
committerRonnie Sahlberg <ronniesahlberg@gmail.com>2008-12-04 14:35:00 +1100
commit51ace76ea6f161ac7e435b45f43413ad49ebf9e2 (patch)
treeeb90229fec7ce3f59ae2f9c48ce1cf3b08d6797a /ctdb/doc/recovery-process.txt
parent539f044aa3abe37645ec2424935e5ba616778250 (diff)
downloadsamba-51ace76ea6f161ac7e435b45f43413ad49ebf9e2.tar.gz
add a description of the recovery-process
(This used to be ctdb commit 194abb41e1a0036956a9401efcae8b14ed66c532)
Diffstat (limited to 'ctdb/doc/recovery-process.txt')
-rw-r--r--ctdb/doc/recovery-process.txt484
1 files changed, 484 insertions, 0 deletions
diff --git a/ctdb/doc/recovery-process.txt b/ctdb/doc/recovery-process.txt
new file mode 100644
index 00000000000..d0294a30fb7
--- /dev/null
+++ b/ctdb/doc/recovery-process.txt
@@ -0,0 +1,484 @@
+Valid as of 1.0.66, may/will change in the future
+
+
+RECMASTER
+=========
+Recovery Master, this is one of the nodes in the cluster that has been designated to
+be the "recovery master".
+The recovery master is responsible for performing full checks of cluster and cluster node consistency and is also responsible for performing the actual database recovery procedure.
+
+Only one node at a time can be the recovery master.
+This is ensured by CTDB using a lock on a single file in the shared gpfs filesystem:
+ /etc/sysconfig/ctdb :
+ ...
+ # Options to ctdbd. This is read by /etc/init.d/ctdb
+ # you must specify the location of a shared lock file across all the
+ # nodes. This must be on shared storage
+ # there is no default here
+ CTDB_RECOVERY_LOCK=/gpfs/.ctdb/shared
+ ...
+
+In order to prevent that two nodes become recovery master at the same time (==split brain)
+CTDB here relies on GPFS that GPFS will guarantee coherent locking across the cluster.
+Thus CTDB relies on that GPFS MUST only allow one ctdb process on one node to take out and
+hold this lock.
+
+The recovery master is designated through an election process.
+
+
+VNNMAP
+======
+The VNNMAP is a list of all nodes in the cluster that is currently part of the cluster
+and participates in hosting the cluster databases.
+All nodes that are CONNECTED but not BANNED be present in the VNNMAP.
+
+The VNNMAP is the list of LMASTERS for the cluster as reported by 'ctdb status' "
+ ...
+ Size:3
+ hash:0 lmaster:0
+ hash:1 lmaster:1
+ hash:2 lmaster:2
+ ...
+
+
+CLUSTER MONITORING
+==================
+All nodes in the cluster monitor its own health and its own consistency regards to the
+recovery master. How and what the nodes monitor for differs between the node which is
+the recovery master and normal nodes.
+This monitoring it to ensure that the cluster is healthy and consistent.
+This is not related to monitoring of inidividual node health, a.k.a. eventscript monitroing.
+
+At the end of each step in the process are listed some of the most common and important
+error messages that can be generated during that step.
+
+
+NORMAL NODE CLUSTER MONITORING
+------------------------------
+Monitoring is performed in the dedicated recovery daemon process.
+The implementation can be found in server/ctdb_recoverd.c:monitor_cluster()
+This is an overview of the more important tasks during monitoring.
+These tests are to verify that the local node is consistent with the recovery master.
+
+Once every second the following monitoring loop is performed :
+
+1, Verify that the parent ctdb daemon on the local node is still running.
+ If it is not, the recovery daemon logs an error and terminates.
+ "CTDB daemon is no longer available. Shutting down recovery daemon"
+
+2, Check if any of the nodes has been recorded to have misbehaved too many times.
+ If so we ban the node and log a message :
+ "Node %u has caused %u failures in %.0f seconds - banning it for %u seconds"
+
+3, Check that there is a recovery master.
+ If not we initiate a clusterwide election process and log :
+ "Initial recovery master set - forcing election"
+ and we restart monitoring from 1.
+
+4, Verify that recovery daemon and the local ctdb daemon agreed on all the
+ node BANNING flags.
+ If the recovery daemon and the local ctdb daemon disagrees on these flags we update
+ the local ctdb daemon, logs one of two messages and restarts monitoring from 1 again.
+ "Local ctdb daemon on recmaster does not think this node is BANNED but the recovery master disagrees. Unbanning the node"
+ "Local ctdb daemon on non-recmaster does not think this node is BANNED but the recovery master disagrees. Re-banning the node"
+
+5, Verify that the node designated to be recovery master exists in the local list of all nodes.
+ If the recovery master is not in the list of all cluster nodes a new recovery master
+ election is triggered and monitoring restarts from 1.
+ "Recmaster node %u not in list. Force reelection"
+
+6, Check if the recovery master has become disconnected.
+ If is has, log an error message, force a new election and restart monitoring from 1.
+ "Recmaster node %u is disconnected. Force reelection"
+
+7, Read the node flags off the recovery master and verify that it has not become banned.
+ If is has, log an error message, force a new election and restart monitoring from 1.
+ "Recmaster node %u no longer available. Force reelection"
+
+8, Verify that the recmaster and the local node agrees on the flags (BANNED/DISABLED/...)
+ for the local node.
+ If there is an inconsistency, push the flags for the local node out to all other nodes.
+ "Recmaster disagrees on our flags flags:0x%x recmaster_flags:0x%x Broadcasting out flags."
+
+9, Verify that the local node hosts all public ip addresses it should host and that it does
+ NOT host any public addresses it should not host.
+ If there is an inconsistency we log an error, trigger a recovery to occur and restart
+ monitoring from 1 again.
+ "Public address '%s' is missing and we should serve this ip"
+ "We are still serving a public address '%s' that we should not be serving."
+
+These are all the checks we perform during monitoring for a normal node.
+These tests are performed on all nodes in the cluster which is why it is optimized to perform
+as few network calls to other nodes as possible.
+Each node only performs 1 call to the recovery master in each loop and to no other nodes.
+
+NORMAL NODE CLUSTER MONITORING
+------------------------------
+The recovery master performs a much more extensive test. In addition to tests 1-9 above
+the recovery master also performs the following tests:
+
+10, Read the list of nodes and flags from all other CONNECTED nodes in the cluster.
+ If there is a failure to read this list from one of the nodes, then log an
+ error, mark this node as a candidate to become BANNED and restart monitoring from 1.
+ "Unable to get nodemap from remote node %u"
+
+11, Verify that the local recovery master and the remote node agrees on the flags
+ for the remote node. If there is a inconsistency for the BANNING flag,
+ log an error, trigger a new recmaster election and restart monitoring from 1.
+ "Remote node %u had different BANNED flags 0x%x, local had 0x%x - trigger a re-election"
+ "Remote node %u had flags 0x%x, local had 0x%x - updating local"
+
+12, Verify that the local recovery master and the remote node agrees on the flags
+ for the remote node. If one of the flags other than the BANNING flag was inconsistent,
+ just update the set of flags for the local recovery daemon, log an information message
+ and continue monitoring.
+ "Remote node %u had flags 0x%x, local had 0x%x - updating local"
+
+13, Read the list of public ip addresses from all of the CONNECTED nodes and merge into a
+ single clusterwide list.
+ If we fail to read the list of ips from a node, log an error and restart monitoring from 1.
+ "Failed to read public ips from node : %u"
+
+14, Verify that all other nodes agree that this node is the recovery master.
+ If one of the other nodes discgrees this is the recovery master, log an error,
+ force a new election and restart monitoring from 1.
+ "Node %d does not agree we are the recmaster. Need a new recmaster election"
+
+15, Check if the previous attempt to run a recovery failed, and if it did, try a new recovery.
+ After the recovery, restart monitoring from 1.
+ "Starting do_recovery"
+
+16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL.
+ If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart
+ monitoring from 1.
+ "Node:%u was in recovery mode. Restart recovery process"
+
+17, Verify that the filehandle to the recovery lock file is valid.
+ If it is not, this may mean a split brain and is a critical error.
+ Try a new recovery and restart monitoring from 1.
+ "recovery master doesn't have the recovery lock"
+
+18, Verify that GPFS allows us to read from the recovery lock file.
+ If not there is a critical GPFS issue and we may have a split brain.
+ Try forcing a new recovery and restart monitoring from 1.
+ "failed read from recovery_lock_fd - %s"
+
+19, Read the list of all nodes and flags from all CONNECTED nodes in the cluster.
+ If fail to read the nodemap from one of the remote nodes, log an error and restart
+ monitoring from 1.
+ "Unable to get nodemap from remote node %u"
+
+20, If the nodemap differs between the local node and the remote node, log an error
+ and force a recovery.
+ This would happen if the /etc/ctdb/nodes file differs across nodes in the cluster.
+ It is unlikely that the recovery will rectify the situation.
+ This is a critical error, it is most likely the entire cluster will be unavailable
+ until the files are fixed or have became banned.
+ "Remote node:%u has different node count. %u vs %u of the local node"
+
+21, If a remote node disagrees on the content of the nodes list, try a recovery and restart
+ monitoring from 1.
+ It is unlikely that the recovery will rectify the situation.
+ This is a critical error, it is most likely the entire cluster will be unavailable
+ until the files are fixed or have became banned.
+ "Remote node:%u has different nodemap pnn for %d (%u vs %u)."
+
+22, If a remote node disgrees on the node flags in the list, try a recovery to re-sync
+ the flags and restart monitoring from 1.
+ "Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)"
+
+23, Verify that all active nodes are part of the VNNMAP.
+ If not, this would be a new node that has become CONNECTED but does not yet participate
+ in the cluster.
+ Perform a recovery to merge the new node to the cluster and restart monitoring from 1.
+ "The vnnmap count is different from the number of active nodes. %u vs %u"
+ or
+ "Node %u is active in the nodemap but did not exist in the vnnmap"
+
+24, Read the VNNMAP from all CONNECTED nodes.
+ Verify that all nodes have the same VNNMAP content and that all nodes are in the same
+ generation instance of the databases.
+ If not, force a recovery to re-synchronize the vnnmap and the databases across the cluster
+ and restart monitoring from 1.
+ "Remote node %u has different generation of vnnmap. %u vs %u (ours)"
+ "Remote node %u has different size of vnnmap. %u vs %u (ours)"
+ "Remote node %u has different vnnmap."
+
+25, If there has been changes to the cluster that requires a reallocation of public ip
+ addresses. On all nodes run the "startrecovery" event. Run "releaseip" and "takeip"
+ events to reassign the ips across the cluster and finally run the "recovered" event.
+
+Finished monitoring, continue monitoring from 1.
+
+
+CLUSTER RECOVERY
+================
+Recoveries are driven by the recovery daemon on the node that is currently the recovery
+master.
+Most of the logging that is performed during recovery is only logged on the node that
+is the recovery master.
+Make sure to find which node is the recovery master and check the log for that node.
+
+Example log entries that start in column 1 are expected to be present in the
+log. Example log entries that are indented 3 columns are optional and might
+only be present if an error occured.
+
+
+1, Log that recovery has been initiated.
+"Starting do_recovery"
+
+ It might log an informational message :
+"New recovery culprit %u".
+ This is only semi-accurate and might may not mean that there is any problem
+ at all with the node indicated.
+
+
+2, Check if a node has caused too many failed recoveries and if so ban it from
+ the cluster, giving the other nodes in the cluster a chance to recovery
+ operation.
+ "Node %u has caused %u recoveries in %.0f seconds - banning it for %u seconds"
+
+
+3, Verify that the recovery daemon can lock the recovery lock file.
+ At this stage this should be recovery master.
+ If this operation fails it means we have a split brain and have to abort recovery.
+ "("ctdb_recovery_lock: Unable to open %s - (%s)"
+ "ctdb_recovery_lock: Failed to get recovery lock on '%s'"
+ "Unable to get recovery lock - aborting recovery"
+"ctdb_recovery_lock: Got recovery lock on '%s'"
+
+
+4, Log which node caused the recovery to be initiated.
+ This is a semi-accurate information message only.
+ This line does NOT mean that there has to be something wrong with the node listed.
+"Recovery initiated due to problem with node %u"
+
+
+5, Pull the names of all databases from all nodes and verify that these databases also
+ exists locally.
+ If a database is missing locally, just create it.
+ It is not an error if a database is missing locally. Databases are created on demand and
+ this could happen if it was one database which samba has never tried to access on the
+ local node.
+
+
+6, Check the list of databases on each remote node and create any databases that may be missing
+ on the remote node.
+"Recovery - created remote databases"
+
+
+7, Set recovery mode to ACTIVE on all remote nodes.
+
+
+8, Run the "startrecovery" eventscript on all nodes.
+
+ At this stage you will also get a few additional log entries, these are not
+ from the recovery daemon but from the main ctdb daemon due to running
+ the eventscript :
+"startrecovery eventscript has been invoked"
+"Monitoring has been disabled"
+"Executing event script ...
+...
+
+
+9, Create a new generation id and update the generation id and the VNNMAP on the local node
+ only.
+ This guarantees that the generation id will now be inconsistent across the cluster and
+ that if recovery fails a new recovery is attempted in the next iteration of the monitoring
+ loop.
+
+
+10, Start a TDB TRANSACTION on all nodes for all databases.
+ This is to ensure that if recovery is aborted or fails that we do not
+ modify any databases on only some of the nodes.
+"started transactions on all nodes"
+
+
+11, For each database, pull the content from all CONNECTED nodes and merge it into
+ the TDB databases on the local node.
+ This merges the records from the remote nodes based on their serial numbers so we
+ only keep the most recent record found.
+"Recovery - pulled remote database 0x%x"
+
+
+12, For each database, perform a fast TDB WIPE operation to delete the entire TDB under the
+ transaction started above.
+
+
+13, For each database, drop all empty records.
+ Force the DMASTER field of all records to point to the recovery master.
+ Push the database out to all other nodes.
+
+ The PUSH process lists some additional log entries for each database of the
+ form :
+"Recovery - pulled remote database 0x..."
+"Recovery - pushed remote database 0x... of size ..."
+
+
+14, Commit all changes to all TDB databases.
+"Recovery - starting database commits"
+"Recovery - committed databases"
+
+
+15, Create a new VNNMAP of all CONNECTED nodes, create a new generation number
+ and piush this new VNNMAP out to all nodes.
+"Recovery - updated vnnmap"
+
+
+16, Update all nodes that the local node is the recovery master.
+"Recovery - updated recmaster"
+
+
+17, synchronize node flags across the cluster.
+"Recovery - updated flags"
+
+18, Change recovery mode back to NORMAL.
+"Recovery - disabled recovery mode"
+
+
+19, Re-allocate all public ip addresses across the cluster.
+"Deterministic IPs enabled. Resetting all ip allocations"
+
+ If the IP address allocation on the local node changes you might get
+ "Takeover of IP 10.0.0.201/24 on interface eth0"
+ "Release of IP 10.0.0.204/24 on interface eth0"
+
+"Recovery - takeip finished"
+
+
+20, Run the "recovered" eventscript on all nodes.
+"Recovery - finished the recovered event"
+
+ You will also get an entry from the local ctdb daemon itself that it has
+ switched back to recovery mode NORMAL.
+"Recovery has finished"
+
+
+21, Broadcast a message to all samba daemons in the cluster that the databases have been
+ recovered. Samba will now do some additional checking/cleanup of the content in the stored
+ records.
+
+"Recovery complete"
+
+
+22. Finished. At this stage a 10 second timeout (ctdb listvars : rerecoverytimeout) is
+ initiated. The cluster will not allow a new recovery to be performed until this timeout
+ has expired.
+
+"New recoveries supressed for the rerecovery timeout"
+"Rerecovery timeout elapsed. Recovery reactivated."
+
+
+
+
+
+
+
+Example : RECOVERY LOG ON RECMASTER
+====================================
+2008/12/01 09:57:28.110732 [ 4933]: 10.0.0.21:4379: node 10.0.0.24:4379 is dead:
+ 2 connected
+2008/12/01 09:57:28.110838 [ 4933]: Tearing down connection to dead node :3
+2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:2682 The vnnmap count
+ is different from the number of active nodes. 4 vs 3
+2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1327 Starting do_reco
+very
+2008/12/01 09:57:28.967297 [ 4935]: ctdb_recovery_lock: Got recovery lock on '/g
+pfs/.ctdb/shared'
+2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1355 Recovery initiat
+ed due to problem with node 0
+2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1381 Recovery - creat
+ed remote databases
+2008/12/01 09:57:28.973543 [ 4933]: server/ctdb_recover.c:589 Recovery mode set
+to ACTIVE
+2008/12/01 09:57:28.974823 [ 4933]: server/ctdb_recover.c:904 startrecovery even
+tscript has been invoked
+2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1431 started transact
+ions on all nodes
+2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x42fe72c5
+2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x42fe72c5 of size 0
+2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x1421fb78
+2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x1421fb78 of size 0
+2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0xc0bdde6a
+2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0xc0bdde6a of size 0
+2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x17055d90
+2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x17055d90 of size 8
+2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x7bbbd26c
+2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x7bbbd26c of size 1
+2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0xf2a58948
+2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0xf2a58948 of size 51
+2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x92380e87
+2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x92380e87 of size 17
+2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x63501287
+2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x63501287 of size 1
+2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0xe98e08b6
+2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0xe98e08b6 of size 4
+2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0x2672a57f
+2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0x2672a57f of size 28
+2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle
+d remote database 0xb775fff6
+2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe
+d remote database 0xb775fff6 of size 6
+2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1440 Recovery - start
+ing database commits
+2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1452 Recovery - commi
+tted databases
+2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1502 Recovery - updat
+ed vnnmap
+2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1511 Recovery - updat
+ed recmaster
+2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1522 Recovery - updat
+ed flags
+2008/12/01 09:57:29.305235 [ 4933]: server/ctdb_recover.c:589 Recovery mode set
+to NORMAL
+2008/12/01 09:57:29.307245 [ 4935]: server/ctdb_recoverd.c:1531 Recovery - disab
+led recovery mode
+2008/12/01 09:57:29.307245 [ 4935]: Deterministic IPs enabled. Resetting all ip
+allocations
+2008/12/01 09:57:29.311071 [ 4933]: takeoverip called for an ip '10.0.0.201' tha
+t is not a public address
+2008/12/01 09:57:29.311186 [ 4933]: takeoverip called for an ip '10.0.0.202' tha
+t is not a public address
+2008/12/01 09:57:29.311204 [ 4933]: takeoverip called for an ip '10.0.0.203' tha
+t is not a public address
+2008/12/01 09:57:29.311299 [ 4933]: takeoverip called for an ip '10.0.0.204' tha
+t is not a public address
+2008/12/01 09:57:29.537210 [ 4935]: server/ctdb_recoverd.c:1542 Recovery - takei
+p finished
+2008/12/01 09:57:29.545404 [ 4933]: Recovery has finished
+2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1551 Recovery - finis
+hed the recovered event
+2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1557 Recovery complet
+e
+2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1565 New recoveries s
+upressed for the rerecovery timeout
+2008/12/01 09:57:39.815648 [ 4935]: server/ctdb_recoverd.c:1567 Rerecovery timeo
+ut elapsed. Recovery reactivated.
+
+
+
+
+
+
+
+