diff options
Diffstat (limited to 'ctdb/doc/recovery-process.txt')
-rw-r--r-- | ctdb/doc/recovery-process.txt | 484 |
1 files changed, 484 insertions, 0 deletions
diff --git a/ctdb/doc/recovery-process.txt b/ctdb/doc/recovery-process.txt new file mode 100644 index 00000000000..d0294a30fb7 --- /dev/null +++ b/ctdb/doc/recovery-process.txt @@ -0,0 +1,484 @@ +Valid as of 1.0.66, may/will change in the future + + +RECMASTER +========= +Recovery Master, this is one of the nodes in the cluster that has been designated to +be the "recovery master". +The recovery master is responsible for performing full checks of cluster and cluster node consistency and is also responsible for performing the actual database recovery procedure. + +Only one node at a time can be the recovery master. +This is ensured by CTDB using a lock on a single file in the shared gpfs filesystem: + /etc/sysconfig/ctdb : + ... + # Options to ctdbd. This is read by /etc/init.d/ctdb + # you must specify the location of a shared lock file across all the + # nodes. This must be on shared storage + # there is no default here + CTDB_RECOVERY_LOCK=/gpfs/.ctdb/shared + ... + +In order to prevent that two nodes become recovery master at the same time (==split brain) +CTDB here relies on GPFS that GPFS will guarantee coherent locking across the cluster. +Thus CTDB relies on that GPFS MUST only allow one ctdb process on one node to take out and +hold this lock. + +The recovery master is designated through an election process. + + +VNNMAP +====== +The VNNMAP is a list of all nodes in the cluster that is currently part of the cluster +and participates in hosting the cluster databases. +All nodes that are CONNECTED but not BANNED be present in the VNNMAP. + +The VNNMAP is the list of LMASTERS for the cluster as reported by 'ctdb status' " + ... + Size:3 + hash:0 lmaster:0 + hash:1 lmaster:1 + hash:2 lmaster:2 + ... + + +CLUSTER MONITORING +================== +All nodes in the cluster monitor its own health and its own consistency regards to the +recovery master. How and what the nodes monitor for differs between the node which is +the recovery master and normal nodes. +This monitoring it to ensure that the cluster is healthy and consistent. +This is not related to monitoring of inidividual node health, a.k.a. eventscript monitroing. + +At the end of each step in the process are listed some of the most common and important +error messages that can be generated during that step. + + +NORMAL NODE CLUSTER MONITORING +------------------------------ +Monitoring is performed in the dedicated recovery daemon process. +The implementation can be found in server/ctdb_recoverd.c:monitor_cluster() +This is an overview of the more important tasks during monitoring. +These tests are to verify that the local node is consistent with the recovery master. + +Once every second the following monitoring loop is performed : + +1, Verify that the parent ctdb daemon on the local node is still running. + If it is not, the recovery daemon logs an error and terminates. + "CTDB daemon is no longer available. Shutting down recovery daemon" + +2, Check if any of the nodes has been recorded to have misbehaved too many times. + If so we ban the node and log a message : + "Node %u has caused %u failures in %.0f seconds - banning it for %u seconds" + +3, Check that there is a recovery master. + If not we initiate a clusterwide election process and log : + "Initial recovery master set - forcing election" + and we restart monitoring from 1. + +4, Verify that recovery daemon and the local ctdb daemon agreed on all the + node BANNING flags. + If the recovery daemon and the local ctdb daemon disagrees on these flags we update + the local ctdb daemon, logs one of two messages and restarts monitoring from 1 again. + "Local ctdb daemon on recmaster does not think this node is BANNED but the recovery master disagrees. Unbanning the node" + "Local ctdb daemon on non-recmaster does not think this node is BANNED but the recovery master disagrees. Re-banning the node" + +5, Verify that the node designated to be recovery master exists in the local list of all nodes. + If the recovery master is not in the list of all cluster nodes a new recovery master + election is triggered and monitoring restarts from 1. + "Recmaster node %u not in list. Force reelection" + +6, Check if the recovery master has become disconnected. + If is has, log an error message, force a new election and restart monitoring from 1. + "Recmaster node %u is disconnected. Force reelection" + +7, Read the node flags off the recovery master and verify that it has not become banned. + If is has, log an error message, force a new election and restart monitoring from 1. + "Recmaster node %u no longer available. Force reelection" + +8, Verify that the recmaster and the local node agrees on the flags (BANNED/DISABLED/...) + for the local node. + If there is an inconsistency, push the flags for the local node out to all other nodes. + "Recmaster disagrees on our flags flags:0x%x recmaster_flags:0x%x Broadcasting out flags." + +9, Verify that the local node hosts all public ip addresses it should host and that it does + NOT host any public addresses it should not host. + If there is an inconsistency we log an error, trigger a recovery to occur and restart + monitoring from 1 again. + "Public address '%s' is missing and we should serve this ip" + "We are still serving a public address '%s' that we should not be serving." + +These are all the checks we perform during monitoring for a normal node. +These tests are performed on all nodes in the cluster which is why it is optimized to perform +as few network calls to other nodes as possible. +Each node only performs 1 call to the recovery master in each loop and to no other nodes. + +NORMAL NODE CLUSTER MONITORING +------------------------------ +The recovery master performs a much more extensive test. In addition to tests 1-9 above +the recovery master also performs the following tests: + +10, Read the list of nodes and flags from all other CONNECTED nodes in the cluster. + If there is a failure to read this list from one of the nodes, then log an + error, mark this node as a candidate to become BANNED and restart monitoring from 1. + "Unable to get nodemap from remote node %u" + +11, Verify that the local recovery master and the remote node agrees on the flags + for the remote node. If there is a inconsistency for the BANNING flag, + log an error, trigger a new recmaster election and restart monitoring from 1. + "Remote node %u had different BANNED flags 0x%x, local had 0x%x - trigger a re-election" + "Remote node %u had flags 0x%x, local had 0x%x - updating local" + +12, Verify that the local recovery master and the remote node agrees on the flags + for the remote node. If one of the flags other than the BANNING flag was inconsistent, + just update the set of flags for the local recovery daemon, log an information message + and continue monitoring. + "Remote node %u had flags 0x%x, local had 0x%x - updating local" + +13, Read the list of public ip addresses from all of the CONNECTED nodes and merge into a + single clusterwide list. + If we fail to read the list of ips from a node, log an error and restart monitoring from 1. + "Failed to read public ips from node : %u" + +14, Verify that all other nodes agree that this node is the recovery master. + If one of the other nodes discgrees this is the recovery master, log an error, + force a new election and restart monitoring from 1. + "Node %d does not agree we are the recmaster. Need a new recmaster election" + +15, Check if the previous attempt to run a recovery failed, and if it did, try a new recovery. + After the recovery, restart monitoring from 1. + "Starting do_recovery" + +16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL. + If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart + monitoring from 1. + "Node:%u was in recovery mode. Restart recovery process" + +17, Verify that the filehandle to the recovery lock file is valid. + If it is not, this may mean a split brain and is a critical error. + Try a new recovery and restart monitoring from 1. + "recovery master doesn't have the recovery lock" + +18, Verify that GPFS allows us to read from the recovery lock file. + If not there is a critical GPFS issue and we may have a split brain. + Try forcing a new recovery and restart monitoring from 1. + "failed read from recovery_lock_fd - %s" + +19, Read the list of all nodes and flags from all CONNECTED nodes in the cluster. + If fail to read the nodemap from one of the remote nodes, log an error and restart + monitoring from 1. + "Unable to get nodemap from remote node %u" + +20, If the nodemap differs between the local node and the remote node, log an error + and force a recovery. + This would happen if the /etc/ctdb/nodes file differs across nodes in the cluster. + It is unlikely that the recovery will rectify the situation. + This is a critical error, it is most likely the entire cluster will be unavailable + until the files are fixed or have became banned. + "Remote node:%u has different node count. %u vs %u of the local node" + +21, If a remote node disagrees on the content of the nodes list, try a recovery and restart + monitoring from 1. + It is unlikely that the recovery will rectify the situation. + This is a critical error, it is most likely the entire cluster will be unavailable + until the files are fixed or have became banned. + "Remote node:%u has different nodemap pnn for %d (%u vs %u)." + +22, If a remote node disgrees on the node flags in the list, try a recovery to re-sync + the flags and restart monitoring from 1. + "Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)" + +23, Verify that all active nodes are part of the VNNMAP. + If not, this would be a new node that has become CONNECTED but does not yet participate + in the cluster. + Perform a recovery to merge the new node to the cluster and restart monitoring from 1. + "The vnnmap count is different from the number of active nodes. %u vs %u" + or + "Node %u is active in the nodemap but did not exist in the vnnmap" + +24, Read the VNNMAP from all CONNECTED nodes. + Verify that all nodes have the same VNNMAP content and that all nodes are in the same + generation instance of the databases. + If not, force a recovery to re-synchronize the vnnmap and the databases across the cluster + and restart monitoring from 1. + "Remote node %u has different generation of vnnmap. %u vs %u (ours)" + "Remote node %u has different size of vnnmap. %u vs %u (ours)" + "Remote node %u has different vnnmap." + +25, If there has been changes to the cluster that requires a reallocation of public ip + addresses. On all nodes run the "startrecovery" event. Run "releaseip" and "takeip" + events to reassign the ips across the cluster and finally run the "recovered" event. + +Finished monitoring, continue monitoring from 1. + + +CLUSTER RECOVERY +================ +Recoveries are driven by the recovery daemon on the node that is currently the recovery +master. +Most of the logging that is performed during recovery is only logged on the node that +is the recovery master. +Make sure to find which node is the recovery master and check the log for that node. + +Example log entries that start in column 1 are expected to be present in the +log. Example log entries that are indented 3 columns are optional and might +only be present if an error occured. + + +1, Log that recovery has been initiated. +"Starting do_recovery" + + It might log an informational message : +"New recovery culprit %u". + This is only semi-accurate and might may not mean that there is any problem + at all with the node indicated. + + +2, Check if a node has caused too many failed recoveries and if so ban it from + the cluster, giving the other nodes in the cluster a chance to recovery + operation. + "Node %u has caused %u recoveries in %.0f seconds - banning it for %u seconds" + + +3, Verify that the recovery daemon can lock the recovery lock file. + At this stage this should be recovery master. + If this operation fails it means we have a split brain and have to abort recovery. + "("ctdb_recovery_lock: Unable to open %s - (%s)" + "ctdb_recovery_lock: Failed to get recovery lock on '%s'" + "Unable to get recovery lock - aborting recovery" +"ctdb_recovery_lock: Got recovery lock on '%s'" + + +4, Log which node caused the recovery to be initiated. + This is a semi-accurate information message only. + This line does NOT mean that there has to be something wrong with the node listed. +"Recovery initiated due to problem with node %u" + + +5, Pull the names of all databases from all nodes and verify that these databases also + exists locally. + If a database is missing locally, just create it. + It is not an error if a database is missing locally. Databases are created on demand and + this could happen if it was one database which samba has never tried to access on the + local node. + + +6, Check the list of databases on each remote node and create any databases that may be missing + on the remote node. +"Recovery - created remote databases" + + +7, Set recovery mode to ACTIVE on all remote nodes. + + +8, Run the "startrecovery" eventscript on all nodes. + + At this stage you will also get a few additional log entries, these are not + from the recovery daemon but from the main ctdb daemon due to running + the eventscript : +"startrecovery eventscript has been invoked" +"Monitoring has been disabled" +"Executing event script ... +... + + +9, Create a new generation id and update the generation id and the VNNMAP on the local node + only. + This guarantees that the generation id will now be inconsistent across the cluster and + that if recovery fails a new recovery is attempted in the next iteration of the monitoring + loop. + + +10, Start a TDB TRANSACTION on all nodes for all databases. + This is to ensure that if recovery is aborted or fails that we do not + modify any databases on only some of the nodes. +"started transactions on all nodes" + + +11, For each database, pull the content from all CONNECTED nodes and merge it into + the TDB databases on the local node. + This merges the records from the remote nodes based on their serial numbers so we + only keep the most recent record found. +"Recovery - pulled remote database 0x%x" + + +12, For each database, perform a fast TDB WIPE operation to delete the entire TDB under the + transaction started above. + + +13, For each database, drop all empty records. + Force the DMASTER field of all records to point to the recovery master. + Push the database out to all other nodes. + + The PUSH process lists some additional log entries for each database of the + form : +"Recovery - pulled remote database 0x..." +"Recovery - pushed remote database 0x... of size ..." + + +14, Commit all changes to all TDB databases. +"Recovery - starting database commits" +"Recovery - committed databases" + + +15, Create a new VNNMAP of all CONNECTED nodes, create a new generation number + and piush this new VNNMAP out to all nodes. +"Recovery - updated vnnmap" + + +16, Update all nodes that the local node is the recovery master. +"Recovery - updated recmaster" + + +17, synchronize node flags across the cluster. +"Recovery - updated flags" + +18, Change recovery mode back to NORMAL. +"Recovery - disabled recovery mode" + + +19, Re-allocate all public ip addresses across the cluster. +"Deterministic IPs enabled. Resetting all ip allocations" + + If the IP address allocation on the local node changes you might get + "Takeover of IP 10.0.0.201/24 on interface eth0" + "Release of IP 10.0.0.204/24 on interface eth0" + +"Recovery - takeip finished" + + +20, Run the "recovered" eventscript on all nodes. +"Recovery - finished the recovered event" + + You will also get an entry from the local ctdb daemon itself that it has + switched back to recovery mode NORMAL. +"Recovery has finished" + + +21, Broadcast a message to all samba daemons in the cluster that the databases have been + recovered. Samba will now do some additional checking/cleanup of the content in the stored + records. + +"Recovery complete" + + +22. Finished. At this stage a 10 second timeout (ctdb listvars : rerecoverytimeout) is + initiated. The cluster will not allow a new recovery to be performed until this timeout + has expired. + +"New recoveries supressed for the rerecovery timeout" +"Rerecovery timeout elapsed. Recovery reactivated." + + + + + + + +Example : RECOVERY LOG ON RECMASTER +==================================== +2008/12/01 09:57:28.110732 [ 4933]: 10.0.0.21:4379: node 10.0.0.24:4379 is dead: + 2 connected +2008/12/01 09:57:28.110838 [ 4933]: Tearing down connection to dead node :3 +2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:2682 The vnnmap count + is different from the number of active nodes. 4 vs 3 +2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1327 Starting do_reco +very +2008/12/01 09:57:28.967297 [ 4935]: ctdb_recovery_lock: Got recovery lock on '/g +pfs/.ctdb/shared' +2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1355 Recovery initiat +ed due to problem with node 0 +2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1381 Recovery - creat +ed remote databases +2008/12/01 09:57:28.973543 [ 4933]: server/ctdb_recover.c:589 Recovery mode set +to ACTIVE +2008/12/01 09:57:28.974823 [ 4933]: server/ctdb_recover.c:904 startrecovery even +tscript has been invoked +2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1431 started transact +ions on all nodes +2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x42fe72c5 +2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x42fe72c5 of size 0 +2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x1421fb78 +2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x1421fb78 of size 0 +2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0xc0bdde6a +2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0xc0bdde6a of size 0 +2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x17055d90 +2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x17055d90 of size 8 +2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x7bbbd26c +2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x7bbbd26c of size 1 +2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0xf2a58948 +2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0xf2a58948 of size 51 +2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x92380e87 +2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x92380e87 of size 17 +2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x63501287 +2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x63501287 of size 1 +2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0xe98e08b6 +2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0xe98e08b6 of size 4 +2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0x2672a57f +2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0x2672a57f of size 28 +2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulle +d remote database 0xb775fff6 +2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushe +d remote database 0xb775fff6 of size 6 +2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1440 Recovery - start +ing database commits +2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1452 Recovery - commi +tted databases +2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1502 Recovery - updat +ed vnnmap +2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1511 Recovery - updat +ed recmaster +2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1522 Recovery - updat +ed flags +2008/12/01 09:57:29.305235 [ 4933]: server/ctdb_recover.c:589 Recovery mode set +to NORMAL +2008/12/01 09:57:29.307245 [ 4935]: server/ctdb_recoverd.c:1531 Recovery - disab +led recovery mode +2008/12/01 09:57:29.307245 [ 4935]: Deterministic IPs enabled. Resetting all ip +allocations +2008/12/01 09:57:29.311071 [ 4933]: takeoverip called for an ip '10.0.0.201' tha +t is not a public address +2008/12/01 09:57:29.311186 [ 4933]: takeoverip called for an ip '10.0.0.202' tha +t is not a public address +2008/12/01 09:57:29.311204 [ 4933]: takeoverip called for an ip '10.0.0.203' tha +t is not a public address +2008/12/01 09:57:29.311299 [ 4933]: takeoverip called for an ip '10.0.0.204' tha +t is not a public address +2008/12/01 09:57:29.537210 [ 4935]: server/ctdb_recoverd.c:1542 Recovery - takei +p finished +2008/12/01 09:57:29.545404 [ 4933]: Recovery has finished +2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1551 Recovery - finis +hed the recovered event +2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1557 Recovery complet +e +2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1565 New recoveries s +upressed for the rerecovery timeout +2008/12/01 09:57:39.815648 [ 4935]: server/ctdb_recoverd.c:1567 Rerecovery timeo +ut elapsed. Recovery reactivated. + + + + + + + + |