From efada2b8e920adfdf7418862e939925d2acd1b89 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 14 Mar 2014 15:43:58 +0200 Subject: Fix race condition in B-tree page deletion. In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan. --- src/include/access/nbtree.h | 59 ++++++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 17 deletions(-) (limited to 'src/include/access/nbtree.h') diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 662888627d..7b26f982b2 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -215,11 +215,10 @@ typedef struct BTMetaPageData #define XLOG_BTREE_SPLIT_L_ROOT 0x50 /* add tuple with split of root */ #define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */ #define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */ -#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */ -#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */ +#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */ +#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */ #define XLOG_BTREE_NEWROOT 0xA0 /* new root page */ -#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes - * parent half-dead */ +#define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */ #define XLOG_BTREE_VACUUM 0xC0 /* delete entries on a page during * vacuum */ #define XLOG_BTREE_REUSE_PAGE 0xD0 /* old page is about to be reused from @@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum #define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber)) /* - * This is what we need to know about deletion of a btree page. The target - * identifies the tuple removed from the parent page (note that we remove - * this tuple's downlink and the *following* tuple's key). Note we do not - * store any content for the deleted page --- it is just rewritten as empty - * during recovery, apart from resetting the btpo.xact. + * This is what we need to know about marking an empty branch for deletion. + * The target identifies the tuple removed from the parent page (note that we + * remove this tuple's downlink and the *following* tuple's key). Note that + * the leaf page is empty, so we don't need to store its content --- it is + * just reinitialized during recovery using the rest of the fields. */ -typedef struct xl_btree_delete_page +typedef struct xl_btree_mark_page_halfdead { xl_btreetid target; /* deleted tuple id in parent page */ - BlockNumber deadblk; /* child block being deleted */ - BlockNumber leftblk; /* child block's left sibling, if any */ - BlockNumber rightblk; /* child block's right sibling */ + BlockNumber leafblk; /* leaf block ultimately being deleted */ + BlockNumber leftblk; /* leaf block's left sibling, if any */ + BlockNumber rightblk; /* leaf block's right sibling */ + BlockNumber downlink; /* next child down in the branch */ +} xl_btree_mark_page_halfdead; + +#define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, downlink) + sizeof(BlockNumber)) + +/* + * This is what we need to know about deletion of a btree page. Note we do + * not store any content for the deleted page --- it is just rewritten as empty + * during recovery, apart from resetting the btpo.xact. + */ +typedef struct xl_btree_unlink_page +{ + RelFileNode node; + BlockNumber deadblk; /* target block being deleted */ + BlockNumber leftsib; /* taregt block's left sibling, if any */ + BlockNumber rightsib; /* target block's right sibling */ + + /* + * Information needed to recreate the leaf page, when target is an internal + * page. + */ + BlockNumber leafblk; + BlockNumber leafleftsib; + BlockNumber leafrightsib; + BlockNumber downlink; /* next child down in the branch */ + TransactionId btpo_xact; /* value of btpo.xact for use in recovery */ - /* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */ -} xl_btree_delete_page; + /* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */ +} xl_btree_unlink_page; -#define SizeOfBtreeDeletePage (offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId)) +#define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId)) /* * New root log record. There are zero tuples if this is to establish an @@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf, extern void _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *itemnos, int nitems, BlockNumber lastBlockVacuumed); -extern int _bt_pagedel(Relation rel, Buffer buf, BTStack stack); +extern int _bt_pagedel(Relation rel, Buffer buf); /* * prototypes for functions in nbtsearch.c -- cgit v1.2.1