summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMichael Paquier <michael@paquier.xyz>2018-07-09 10:27:10 +0900
committerMichael Paquier <michael@paquier.xyz>2018-07-09 10:27:10 +0900
commit62203e6084adcebd277a138e2e0ed21f2b8ed305 (patch)
treea2bd92d260906841f5a7c5c731ecf30a19611b8f
parent378f78da86285aba98196201bf5db4eb6c5c0fed (diff)
downloadpostgresql-62203e6084adcebd277a138e2e0ed21f2b8ed305.tar.gz
Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the end-of-recovery record is written can cause PostgreSQL to react inconsistently with the rest of the cluster in the event of a crash before the final record is written. Two such failures are for example an error while processing a two-phase state files or when operating on recovery.conf. With this commit, the failures are still considered FATAL, but the write of the timeline history file is delayed as much as possible so as the window between the moment the file is written and the end-of-recovery record is generated gets minimized. This way, in the event of a crash or a failure, the new timeline decided at promotion will not seem taken by other nodes in the cluster. It is not really possible to reduce to zero this window, hence one could still see failures if a crash happens between the history file write and the end-of-recovery record, so any future code should be careful when adding new end-of-recovery actions. The original report from Magnus Hagander mentioned a renamed recovery.conf as original end-of-recovery failure which caused a timeline to be seen as taken but the subsequent processing on the now-missing recovery.conf cause the startup process to issue stop on FATAL, which at follow-up startup made the system inconsistent because of on-disk changes which already happened. Processing of two-phase state files still needs some work as corrupted entries are simply ignored now. This is left as a future item and this commit fixes the original complain. Reported-by: Magnus Hagander Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
-rw-r--r--src/backend/access/transam/xlog.c37
1 files changed, 25 insertions, 12 deletions
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7061269871..0b00a45a90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7113,6 +7113,13 @@ StartupXLOG(void)
}
/*
+ * Pre-scan prepared transactions to find out the range of XIDs present.
+ * This information is not quite needed yet, but it is positioned here so
+ * as potential problems are detected before any on-disk change is done.
+ */
+ oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+ /*
* Consider whether we need to assign a new timeline ID.
*
* If we are doing an archive recovery, we always assign a new ID. This
@@ -7160,6 +7167,24 @@ StartupXLOG(void)
else
snprintf(reason, sizeof(reason), "no recovery target specified");
+ /*
+ * We are now done reading the old WAL. Turn off archive fetching if
+ * it was active, and make a writable copy of the last WAL segment.
+ * (Note that we also have a copy of the last block of the old WAL in
+ * readBuf; we will use that below.)
+ */
+ exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+ /*
+ * Write the timeline history file, and have it archived. After this
+ * point (or rather, as soon as the file is archived), the timeline
+ * will appear as "taken" in the WAL archive and to any standby
+ * servers. If we crash before actually switching to the new
+ * timeline, standby servers will nevertheless think that we switched
+ * to the new timeline, and will try to connect to the new timeline.
+ * To minimize the window for that, try to do as little as possible
+ * between here and writing the end-of-recovery record.
+ */
writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
EndRecPtr, reason);
}
@@ -7169,15 +7194,6 @@ StartupXLOG(void)
XLogCtl->PrevTimeLineID = PrevTimeLineID;
/*
- * We are now done reading the old WAL. Turn off archive fetching if it
- * was active, and make a writable copy of the last WAL segment. (Note
- * that we also have a copy of the last block of the old WAL in readBuf;
- * we will use that below.)
- */
- if (ArchiveRecoveryRequested)
- exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
- /*
* Prepare to write WAL starting at EndOfLog position, and init xlog
* buffer cache using the block containing the last record from the
* previous incarnation.
@@ -7229,9 +7245,6 @@ StartupXLOG(void)
XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;
- /* Pre-scan prepared transactions to find out the range of XIDs present */
- oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
* record before resource manager writes cleanup WAL records or checkpoint