public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Odd code around ginScanToDelete
13+ messages / 4 participants
[nested] [flat]

* Re: Odd code around ginScanToDelete
@ 2026-02-21 09:54 Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Alexander Korotkov @ 2026-02-21 09:54 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers

On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
>
> On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> >
> > While looking at converting more places to UnlockReleaseBuffer(), in the
> > course of making UnlockReleaseBuffer() faster than the two separate
> > operations, I found this code:
> >
> > static bool
> > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > ...
> >
> >         if (!meDelete)
> >         {
> >                 if (BufferIsValid(me->leftBuffer))
> >                         UnlockReleaseBuffer(me->leftBuffer);
> >                 me->leftBuffer = buffer;
> >         }
> >         else
> >         {
> >                 if (!isRoot)
> >                         LockBuffer(buffer, GIN_UNLOCK);
> >
> >                 ReleaseBuffer(buffer);
> >         }
> >
> >         if (isRoot)
> >                 ReleaseBuffer(buffer);
> >
> >
> > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > that's not reachable, because presumably the root page will always go down the
> > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
>
> Yes, it's not possible to have meDelete set for root, because
> me->leftBuffer is always InvalidBuffer for the root.  So the branch
> handling meDelete case should better do Assert(!isRoot).
>
> > This code was introduced in
> >   commit e14641197a5
> >   Author: Alexander Korotkov <[email protected]>
> >   Date:   2019-11-19 23:07:36 +0300
> >
> >       Fix deadlock between ginDeletePage() and ginStepRight()
> >
> > I didn't trace it further to see if it existed before that in some fashion.
>
> Yes.  I think generally this area needs to be reworked to become more
> clear, and have vast more comments.  It was wrong from my side trying
> to fix bugs there without reworking it into something more
> appropriate.  I'm planning to put work on this during this week.
>
> > There's another oddity here: ginScanToDelete() requires that the root page has
> > been locked by the caller already, but will afaict re-read the root page? But
> > then have code to avoid locking it again, because that would not have worked?
> > Seems odd.
>
>
> It seems a bit odd for me that caller already have locked buffer, but
> passes BlockNumber making us re-read the buffer.  But I'm not sure
> that's the same as your point.  Could you, please, elaborate more on
> this?

Here is the refactoring patch.  Sorry for the delay.

------
Regards,
Alexander Korotkov
Supabase


Attachments:

  [application/octet-stream] v1-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patch (10.8K, 2-v1-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patch)
  download | inline diff:
From 0c4d098c981e297857f527683e8aed4d07098689 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <[email protected]>
Date: Sat, 21 Feb 2026 11:08:08 +0200
Subject: [PATCH v1] Rework ginScanToDelete() to pass Buffers instead of
 BlockNumbers.

Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and
re-read pages that were already pinned and locked during the tree walk.  The
caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet
ginScanToDelete() re-read it by block number with special-case code to skip
re-locking.

Rework both functions to pass Buffers directly.  DataPageDeleteStack now
carries buffer, myoff (downlink offset in parent), and isRoot per level,
so ginScanToDelete() takes only GinVacuumState and DataPageDeleteStack
pointers.  Also, ginDeletePage() receives the three Buffers directly, and
no longer reads or releases them itself.  The caller reads and locks child
pages before recursing, and manages buffer lifecycle afterward.

This eliminates the confusing isRoot special cases in buffer management,
including the apparent (but unreachable) double release of the root
buffer identified by Andres Freund.

Add comments explaining the locking protocol and the DataPageDeleteStack
structure.

Reported-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz
---
 src/backend/access/gin/ginvacuum.c | 188 +++++++++++++++++------------
 1 file changed, 111 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index c9f143f6c31..28002ac97f6 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -110,31 +110,52 @@ xlogVacuumPage(Relation index, Buffer buffer)
 }
 
 
+/*
+ * Stack entry used during posting tree empty-page deletion scan.
+ *
+ * One DataPageDeleteStack entry is allocated per tree level.  As
+ * ginScanToDelete() recurses down the tree, each entry tracks the buffer
+ * of the page currently being visited at that level ('buffer'), and the
+ * rightmost non-deleted page to its left ('leftBuffer').  The left page
+ * is kept pinned and exclusively locked because ginDeletePage() needs it
+ * to update the sibling chain; acquiring it later could deadlock with
+ * ginStepRight(), which locks pages left-to-right (see commit e14641197a5).
+ */
 typedef struct DataPageDeleteStack
 {
 	struct DataPageDeleteStack *child;
 	struct DataPageDeleteStack *parent;
 
-	BlockNumber blkno;			/* current block number */
-	Buffer		leftBuffer;		/* pinned and locked rightest non-deleted page
-								 * on left */
+	Buffer		buffer;			/* buffer for the page being visited at this
+								 * tree level; valid only while recursing
+								 * into children */
+	Buffer		leftBuffer;		/* pinned and locked rightmost non-deleted
+								 * sibling to the left of the current page */
+	OffsetNumber myoff;			/* offset of this page's downlink in the
+								 * parent */
 	bool		isRoot;
 } DataPageDeleteStack;
 
 
 /*
  * Delete a posting tree page.
+ *
+ * Removes the page identified by dBuffer from the posting tree by updating
+ * the left sibling's rightlink (in lBuffer) to skip over the deleted page,
+ * and removing the downlink from the parent page (in pBuffer).  All three
+ * buffers must already be pinned and exclusively locked by the caller.
+ *
+ * The buffers are NOT released nor unlocked here; the caller is responsible
+ * for this.
  */
 static void
-ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno,
-			  BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot)
+ginDeletePage(GinVacuumState *gvs, Buffer dBuffer, Buffer lBuffer,
+			  Buffer pBuffer, OffsetNumber myoff, bool isParentRoot)
 {
-	Buffer		dBuffer;
-	Buffer		lBuffer;
-	Buffer		pBuffer;
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	BlockNumber deleteBlkno = BufferGetBlockNumber(dBuffer);
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -142,12 +163,6 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 * happen in this subtree. Caller also acquires Exclusive locks on
 	 * deletable, parent and left pages.
 	 */
-	lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
-								 RBM_NORMAL, gvs->strategy);
-	dBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, deleteBlkno,
-								 RBM_NORMAL, gvs->strategy);
-	pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno,
-								 RBM_NORMAL, gvs->strategy);
 
 	page = BufferGetPage(dBuffer);
 	rightlink = GinPageGetOpaque(page)->rightlink;
@@ -226,55 +241,37 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
-	ReleaseBuffer(pBuffer);
-	ReleaseBuffer(lBuffer);
-	ReleaseBuffer(dBuffer);
-
 	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
 
 /*
- * Scans posting tree and deletes empty pages.  Caller must lock root page for
- * cleanup.  During scan path from root to current page is kept exclusively
- * locked.  Also keep left page exclusively locked, because ginDeletePage()
- * needs it.  If we try to relock left page later, it could deadlock with
- * ginStepRight().
+ * Scans a posting tree and deletes empty pages.
+ *
+ * The caller must hold a cleanup lock on the root page to prevent concurrent
+ * inserts.  The entire path from the root down to the current page is kept
+ * exclusively locked throughout the scan.  The left sibling at each level is
+ * also kept locked, because ginDeletePage() needs it to update the rightlink
+ * of the left sibling; re-acquiring the left sibling lock later could
+ * deadlock with ginStepRight(), which acquires page locks left-to-right.
+ *
+ * All per-level state is carried in 'myStackItem': the buffer to process
+ * (must already be pinned and exclusively locked), the left sibling buffer,
+ * and this page's offset in the parent's downlink array.  The root entry is
+ * set up by ginVacuumPostingTree(); child entries are populated here before
+ * recursing.
+ *
+ * Returns true if the page was deleted, false otherwise.
  */
 static bool
-ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
-				DataPageDeleteStack *parent, OffsetNumber myoff)
+ginScanToDelete(GinVacuumState *gvs, DataPageDeleteStack *myStackItem)
 {
-	DataPageDeleteStack *me;
-	Buffer		buffer;
+	Buffer		buffer = myStackItem->buffer;
 	Page		page;
-	bool		meDelete = false;
+	bool		pageWasDeleted = false;
 	bool		isempty;
 
-	if (isRoot)
-	{
-		me = parent;
-	}
-	else
-	{
-		if (!parent->child)
-		{
-			me = palloc0_object(DataPageDeleteStack);
-			me->parent = parent;
-			parent->child = me;
-			me->leftBuffer = InvalidBuffer;
-		}
-		else
-			me = parent->child;
-	}
-
-	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-								RBM_NORMAL, gvs->strategy);
-
-	if (!isRoot)
-		LockBuffer(buffer, GIN_EXCLUSIVE);
-
 	page = BufferGetPage(buffer);
 
 	Assert(GinPageIsData(page));
@@ -283,19 +280,48 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 	{
 		OffsetNumber i;
 
-		me->blkno = blkno;
-		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
+		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff;)
 		{
 			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
+			Buffer		childBuffer;
+
+			childBuffer = ReadBufferExtended(gvs->index,
+											 MAIN_FORKNUM,
+											 PostingItemGetBlockNumber(pitem),
+											 RBM_NORMAL, gvs->strategy);
+			LockBuffer(childBuffer, GIN_EXCLUSIVE);
+
+			/* Allocate a child stack entry on first use; reuse thereafter */
+			if (!myStackItem->child)
+			{
+				myStackItem->child = palloc0_object(DataPageDeleteStack);
+				myStackItem->child->parent = myStackItem;
+				myStackItem->child->leftBuffer = InvalidBuffer;
+			}
 
-			if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), false, me, i))
-				i--;
+			myStackItem->child->buffer = childBuffer;
+			myStackItem->child->isRoot = false;
+			myStackItem->child->myoff = i;
+
+			/*
+			 * Recurse into child.  If the child page was deleted, its
+			 * downlink was removed from our page, so re-examine the same
+			 * offset; otherwise advance to the next downlink.
+			 */
+			if (!ginScanToDelete(gvs, myStackItem->child))
+				i++;
 		}
+		myStackItem->buffer = InvalidBuffer;
 
-		if (GinPageRightMost(page) && BufferIsValid(me->child->leftBuffer))
+		/*
+		 * After processing all children at this level, release the child
+		 * level's leftBuffer if we're at the rightmost page.  There is no
+		 * right sibling that could need it for deletion.
+		 */
+		if (GinPageRightMost(page) && BufferIsValid(myStackItem->child->leftBuffer))
 		{
-			UnlockReleaseBuffer(me->child->leftBuffer);
-			me->child->leftBuffer = InvalidBuffer;
+			UnlockReleaseBuffer(myStackItem->child->leftBuffer);
+			myStackItem->child->leftBuffer = InvalidBuffer;
 		}
 	}
 
@@ -306,34 +332,40 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 
 	if (isempty)
 	{
-		/* we never delete the left- or rightmost branch */
-		if (BufferIsValid(me->leftBuffer) && !GinPageRightMost(page))
+		/*
+		 * Proceed to the ginDeletePage() if that's not the leftmost or
+		 * the rightmost page.
+		 */
+		if (BufferIsValid(myStackItem->leftBuffer) && !GinPageRightMost(page))
 		{
-			Assert(!isRoot);
-			ginDeletePage(gvs, blkno, BufferGetBlockNumber(me->leftBuffer),
-						  me->parent->blkno, myoff, me->parent->isRoot);
-			meDelete = true;
+			Assert(!myStackItem->isRoot);
+			ginDeletePage(gvs, buffer, myStackItem->leftBuffer,
+						  myStackItem->parent->buffer, myStackItem->myoff,
+						  myStackItem->parent->isRoot);
+			pageWasDeleted = true;
 		}
 	}
 
-	if (!meDelete)
+	if (!pageWasDeleted)
 	{
-		if (BufferIsValid(me->leftBuffer))
-			UnlockReleaseBuffer(me->leftBuffer);
-		me->leftBuffer = buffer;
+		/*
+		 * Keep this page as the new leftBuffer for this level: the next
+		 * sibling to the right might need it for deletion.  Release any
+		 * previously held left page first.
+		 */
+		if (BufferIsValid(myStackItem->leftBuffer))
+			UnlockReleaseBuffer(myStackItem->leftBuffer);
+		myStackItem->leftBuffer = buffer;
 	}
 	else
 	{
-		if (!isRoot)
-			LockBuffer(buffer, GIN_UNLOCK);
-
-		ReleaseBuffer(buffer);
+		/*
+		 * Page was deleted; release the buffer.  leftBuffer remains the same.
+		 */
+		UnlockReleaseBuffer(buffer);
 	}
 
-	if (isRoot)
-		ReleaseBuffer(buffer);
-
-	return meDelete;
+	return pageWasDeleted;
 }
 
 
@@ -428,10 +460,12 @@ ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
 		LockBufferForCleanup(buffer);
 
 		memset(&root, 0, sizeof(DataPageDeleteStack));
+		root.buffer = buffer;
 		root.leftBuffer = InvalidBuffer;
+		root.myoff = InvalidOffsetNumber;
 		root.isRoot = true;
 
-		ginScanToDelete(gvs, rootBlkno, true, &root, InvalidOffsetNumber);
+		ginScanToDelete(gvs, &root);
 
 		ptr = root.child;
 
-- 
2.39.5 (Apple Git-154)



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
@ 2026-03-06 14:45 ` Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-11 22:41   ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  0 siblings, 2 replies; 13+ messages in thread

From: Pavel Borisov @ 2026-03-06 14:45 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi, Andres and Alexander!

On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
>
> On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
> >
> > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> > >
> > > While looking at converting more places to UnlockReleaseBuffer(), in the
> > > course of making UnlockReleaseBuffer() faster than the two separate
> > > operations, I found this code:
> > >
> > > static bool
> > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > > ...
> > >
> > >         if (!meDelete)
> > >         {
> > >                 if (BufferIsValid(me->leftBuffer))
> > >                         UnlockReleaseBuffer(me->leftBuffer);
> > >                 me->leftBuffer = buffer;
> > >         }
> > >         else
> > >         {
> > >                 if (!isRoot)
> > >                         LockBuffer(buffer, GIN_UNLOCK);
> > >
> > >                 ReleaseBuffer(buffer);
> > >         }
> > >
> > >         if (isRoot)
> > >                 ReleaseBuffer(buffer);
> > >
> > >
> > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > > that's not reachable, because presumably the root page will always go down the
> > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
> >
> > Yes, it's not possible to have meDelete set for root, because
> > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > handling meDelete case should better do Assert(!isRoot).
> > > This code was introduced in
> > >   commit e14641197a5
> > >   Author: Alexander Korotkov <[email protected]>
> > >   Date:   2019-11-19 23:07:36 +0300
> > >
> > >       Fix deadlock between ginDeletePage() and ginStepRight()
> > >
> > > I didn't trace it further to see if it existed before that in some fashion.
> >
> > Yes.  I think generally this area needs to be reworked to become more
> > clear, and have vast more comments.  It was wrong from my side trying
> > to fix bugs there without reworking it into something more
> > appropriate.  I'm planning to put work on this during this week.
> >
> > > There's another oddity here: ginScanToDelete() requires that the root page has
> > > been locked by the caller already, but will afaict re-read the root page? But
> > > then have code to avoid locking it again, because that would not have worked?
> > > Seems odd.
> >
> >
> > It seems a bit odd for me that caller already have locked buffer, but
> > passes BlockNumber making us re-read the buffer.  But I'm not sure
> > that's the same as your point.  Could you, please, elaborate more on
> > this?
>
> Here is the refactoring patch.  Sorry for the delay.

Hi, Andres and Alexander!

I've looked into the patch v1.
Overall, it looks good to me.

Some thoughts:

Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.

Could limiting the maximum recursion level be useful?

In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.

> > Yes, it's not possible to have meDelete set for root, because
> > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > handling meDelete case should better do Assert(!isRoot).
Looks like this additional Assert is not in patch v1.

In the root call of ginScanToDelete(gvs, &root); we can add Assert
checking that its return result is false:
-               ginScanToDelete(gvs, &root);
+              deleted = ginScanToDelete(gvs, &root);
+.             Assert(!deleted); /* Root page is never deleted */

Additionally, it could be good to rename all vacuum functions related
to posting tree pages only, to include "Posting" (e.g., ginDeletePage
-> ginDeletePostingPage). The same is for the functions only for the
entry tree. It is already named this way in many places (e.g.
ginVacuumPostingTreeLeaves). It could be good to extend this to all
relevant functions.

Several small proposals on wording:
"rightmost non-deleted page to its left" -> "closest non-deleted
sibling page to its left"
"each entry tracks the buffer of the page" -> "each entry tracks the
buffers of the page" (as two buffers are mentioned)
"must already be pinned" -> "must already have been pinned"

Best regards,
Pavel Borisov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
@ 2026-03-10 09:19   ` Xuneng Zhou <[email protected]>
  2026-03-10 09:28     ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-11 22:51     ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  1 sibling, 2 replies; 13+ messages in thread

From: Xuneng Zhou @ 2026-03-10 09:19 UTC (permalink / raw)
  To: Pavel Borisov <[email protected]>; Alexander Korotkov <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi Pavel, Alexander,

On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <[email protected]> wrote:
>
> Hi, Andres and Alexander!
>
> On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
> >
> > On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
> > >
> > > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> > > >
> > > > While looking at converting more places to UnlockReleaseBuffer(), in the
> > > > course of making UnlockReleaseBuffer() faster than the two separate
> > > > operations, I found this code:
> > > >
> > > > static bool
> > > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> > > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > > > ...
> > > >
> > > >         if (!meDelete)
> > > >         {
> > > >                 if (BufferIsValid(me->leftBuffer))
> > > >                         UnlockReleaseBuffer(me->leftBuffer);
> > > >                 me->leftBuffer = buffer;
> > > >         }
> > > >         else
> > > >         {
> > > >                 if (!isRoot)
> > > >                         LockBuffer(buffer, GIN_UNLOCK);
> > > >
> > > >                 ReleaseBuffer(buffer);
> > > >         }
> > > >
> > > >         if (isRoot)
> > > >                 ReleaseBuffer(buffer);
> > > >
> > > >
> > > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > > > that's not reachable, because presumably the root page will always go down the
> > > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
> > >
> > > Yes, it's not possible to have meDelete set for root, because
> > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > handling meDelete case should better do Assert(!isRoot).
> > > > This code was introduced in
> > > >   commit e14641197a5
> > > >   Author: Alexander Korotkov <[email protected]>
> > > >   Date:   2019-11-19 23:07:36 +0300
> > > >
> > > >       Fix deadlock between ginDeletePage() and ginStepRight()
> > > >
> > > > I didn't trace it further to see if it existed before that in some fashion.
> > >
> > > Yes.  I think generally this area needs to be reworked to become more
> > > clear, and have vast more comments.  It was wrong from my side trying
> > > to fix bugs there without reworking it into something more
> > > appropriate.  I'm planning to put work on this during this week.
> > >
> > > > There's another oddity here: ginScanToDelete() requires that the root page has
> > > > been locked by the caller already, but will afaict re-read the root page? But
> > > > then have code to avoid locking it again, because that would not have worked?
> > > > Seems odd.
> > >
> > >
> > > It seems a bit odd for me that caller already have locked buffer, but
> > > passes BlockNumber making us re-read the buffer.  But I'm not sure
> > > that's the same as your point.  Could you, please, elaborate more on
> > > this?
> >
> > Here is the refactoring patch.  Sorry for the delay.
>
> Hi, Andres and Alexander!
>
> I've looked into the patch v1.
> Overall, it looks good to me.

The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.

> Some thoughts:
>
> Is it worth/possible in recursive calls of ginScanToDelete() to free
> allocated myStackItem->child after processing all children of the
> current level, when they are not needed anymore?
> Previously to this patch, palloc-ed "me" variable also was't freed at
> recursion levels.

I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundary

> Could limiting the maximum recursion level be useful?

Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.

> In the comment to myStackItem before ginScanToDelete(), it might be
> worth adding that after processing all pages on the current level,
> myStackItem is not needed anymore.
>
> > > Yes, it's not possible to have meDelete set for root, because
> > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > handling meDelete case should better do Assert(!isRoot).
> Looks like this additional Assert is not in patch v1.
>
> In the root call of ginScanToDelete(gvs, &root); we can add Assert
> checking that its return result is false:
> -               ginScanToDelete(gvs, &root);
> +              deleted = ginScanToDelete(gvs, &root);
> +.             Assert(!deleted); /* Root page is never deleted */

+ 1,  this is a good invariant check and improves readability

One minor nit for the comment:

The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."


--
Best,
Xuneng





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
@ 2026-03-10 09:28     ` Pavel Borisov <[email protected]>
  2026-03-11 22:22       ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Pavel Borisov @ 2026-03-10 09:28 UTC (permalink / raw)
  To: Xuneng Zhou <[email protected]>; +Cc: Alexander Korotkov <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi, Xuneng

> > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > allocated myStackItem->child after processing all children of the
> > current level, when they are not needed anymore?
> > Previously to this patch, palloc-ed "me" variable also was't freed at
> > recursion levels.
>
> Freeing/reallocating it per subtree would add churn and make the
> lifetime rules harder to reason about without meaningful memory
> savings (the number of nodes is bounded by tree depth, not number of
> pages). We currently free the chain once after ginScanToDelete()
> returns in ginVacuumPostingTree(), which matches the natural lifetime
> boundary
I proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.

Regards,
Pavel Borisov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-10 09:28     ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
@ 2026-03-11 22:22       ` Alexander Korotkov <[email protected]>
  2026-03-11 22:35         ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Alexander Korotkov @ 2026-03-11 22:22 UTC (permalink / raw)
  To: Pavel Borisov <[email protected]>; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <[email protected]> wrote:
> Hi, Xuneng
>
> > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > allocated myStackItem->child after processing all children of the
> > > current level, when they are not needed anymore?
> > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > recursion levels.
> >
> > Freeing/reallocating it per subtree would add churn and make the
> > lifetime rules harder to reason about without meaningful memory
> > savings (the number of nodes is bounded by tree depth, not number of
> > pages). We currently free the chain once after ginScanToDelete()
> > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > boundary
> I proposed not freeing child when child iteration is complete. They
> indeed can be reused. I proposed cleaning children when "my" iteration
> is complete. At that time all the children iterations are completed
> and not needed when we return level up.

This is not clear for me.  We need stack items to keep track of left
pages until we scan the whole posting tree.  After scanning the whole
posting tree we can free stack items as we do now.

------
Regards,
Alexander Korotkov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-10 09:28     ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-11 22:22       ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
@ 2026-03-11 22:35         ` Pavel Borisov <[email protected]>
  2026-03-11 22:37           ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Pavel Borisov @ 2026-03-11 22:35 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <[email protected]> wrote:
>
> On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <[email protected]> wrote:
> > Hi, Xuneng
> >
> > > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > > allocated myStackItem->child after processing all children of the
> > > > current level, when they are not needed anymore?
> > > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > > recursion levels.
> > >
> > > Freeing/reallocating it per subtree would add churn and make the
> > > lifetime rules harder to reason about without meaningful memory
> > > savings (the number of nodes is bounded by tree depth, not number of
> > > pages). We currently free the chain once after ginScanToDelete()
> > > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > > boundary
> > I proposed not freeing child when child iteration is complete. They
> > indeed can be reused. I proposed cleaning children when "my" iteration
> > is complete. At that time all the children iterations are completed
> > and not needed when we return level up.
> This is not clear for me.  We need stack items to keep track of left
> pages until we scan the whole posting tree.  After scanning the whole
> posting tree we can free stack items as we do now.

Hi, Alexander!
You are right, that we can free all posting tree stack items after the
whole tree, as we do now. But I think we can also do it earlier. It
looks like all "children" items are needed and could be reused only
until iteration on "my" level ends. When function returns up the
recursion "my" level becomes "child" for a caller, and previous
"child" is not used anymore.

It's optional, maybe we can do freeing at the end of posting tree iteration.

Regards,
Pavel Borisov





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-10 09:28     ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-11 22:22       ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-11 22:35         ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
@ 2026-03-11 22:37           ` Alexander Korotkov <[email protected]>
  2026-03-16 08:31             ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Alexander Korotkov @ 2026-03-11 22:37 UTC (permalink / raw)
  To: Pavel Borisov <[email protected]>; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Thu, Mar 12, 2026 at 12:35 AM Pavel Borisov <[email protected]> wrote:
>
> On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <[email protected]> wrote:
> >
> > On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <[email protected]> wrote:
> > > Hi, Xuneng
> > >
> > > > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > > > allocated myStackItem->child after processing all children of the
> > > > > current level, when they are not needed anymore?
> > > > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > > > recursion levels.
> > > >
> > > > Freeing/reallocating it per subtree would add churn and make the
> > > > lifetime rules harder to reason about without meaningful memory
> > > > savings (the number of nodes is bounded by tree depth, not number of
> > > > pages). We currently free the chain once after ginScanToDelete()
> > > > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > > > boundary
> > > I proposed not freeing child when child iteration is complete. They
> > > indeed can be reused. I proposed cleaning children when "my" iteration
> > > is complete. At that time all the children iterations are completed
> > > and not needed when we return level up.
> > This is not clear for me.  We need stack items to keep track of left
> > pages until we scan the whole posting tree.  After scanning the whole
> > posting tree we can free stack items as we do now.
>
> You are right, that we can free all posting tree stack items after the
> whole tree, as we do now. But I think we can also do it earlier. It
> looks like all "children" items are needed and could be reused only
> until iteration on "my" level ends. When function returns up the
> recursion "my" level becomes "child" for a caller, and previous
> "child" is not used anymore.

No matter how many levels we can go up, we can still descend and need
the leftBuffer stored at any stack level.

------
Regards,
Alexander Korotkov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-10 09:28     ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-11 22:22       ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-11 22:35         ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-11 22:37           ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
@ 2026-03-16 08:31             ` Xuneng Zhou <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: Xuneng Zhou @ 2026-03-16 08:31 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Pavel Borisov <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Thu, Mar 12, 2026 at 6:37 AM Alexander Korotkov <[email protected]> wrote:
>
> On Thu, Mar 12, 2026 at 12:35 AM Pavel Borisov <[email protected]> wrote:
> >
> > On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <[email protected]> wrote:
> > >
> > > On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <[email protected]> wrote:
> > > > Hi, Xuneng
> > > >
> > > > > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > > > > allocated myStackItem->child after processing all children of the
> > > > > > current level, when they are not needed anymore?
> > > > > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > > > > recursion levels.
> > > > >
> > > > > Freeing/reallocating it per subtree would add churn and make the
> > > > > lifetime rules harder to reason about without meaningful memory
> > > > > savings (the number of nodes is bounded by tree depth, not number of
> > > > > pages). We currently free the chain once after ginScanToDelete()
> > > > > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > > > > boundary
> > > > I proposed not freeing child when child iteration is complete. They
> > > > indeed can be reused. I proposed cleaning children when "my" iteration
> > > > is complete. At that time all the children iterations are completed
> > > > and not needed when we return level up.
> > > This is not clear for me.  We need stack items to keep track of left
> > > pages until we scan the whole posting tree.  After scanning the whole
> > > posting tree we can free stack items as we do now.
> >
> > You are right, that we can free all posting tree stack items after the
> > whole tree, as we do now. But I think we can also do it earlier. It
> > looks like all "children" items are needed and could be reused only
> > until iteration on "my" level ends. When function returns up the
> > recursion "my" level becomes "child" for a caller, and previous
> > "child" is not used anymore.
>
> No matter how many levels we can go up, we can still descend and need
> the leftBuffer stored at any stack level.
>

Yeah, the important point is that a stack item here represents
per-depth scan state, not just one recursive invocation. Returning
from one subtree does not necessarily mean that depth is finished
globally: the caller may move to a sibling and descend again, and that
later descent can still need the child level's saved leftBuffer from
the subtree we just finished. The stronger condition is that no more
pages remain to be scanned to the right at that depth; the code
already uses GinPageRightMost(page) for that when releasing the child
level's leftBuffer.

-- 
Best,
Xuneng





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
@ 2026-03-11 22:51     ` Alexander Korotkov <[email protected]>
  2026-03-12 08:41       ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Alexander Korotkov @ 2026-03-11 22:51 UTC (permalink / raw)
  To: Xuneng Zhou <[email protected]>; +Cc: Pavel Borisov <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <[email protected]> wrote:
> On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <[email protected]> wrote:
> >
> > Hi, Andres and Alexander!
> >
> > On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
> > >
> > > On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
> > > >
> > > > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> > > > >
> > > > > While looking at converting more places to UnlockReleaseBuffer(), in the
> > > > > course of making UnlockReleaseBuffer() faster than the two separate
> > > > > operations, I found this code:
> > > > >
> > > > > static bool
> > > > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> > > > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > > > > ...
> > > > >
> > > > >         if (!meDelete)
> > > > >         {
> > > > >                 if (BufferIsValid(me->leftBuffer))
> > > > >                         UnlockReleaseBuffer(me->leftBuffer);
> > > > >                 me->leftBuffer = buffer;
> > > > >         }
> > > > >         else
> > > > >         {
> > > > >                 if (!isRoot)
> > > > >                         LockBuffer(buffer, GIN_UNLOCK);
> > > > >
> > > > >                 ReleaseBuffer(buffer);
> > > > >         }
> > > > >
> > > > >         if (isRoot)
> > > > >                 ReleaseBuffer(buffer);
> > > > >
> > > > >
> > > > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > > > > that's not reachable, because presumably the root page will always go down the
> > > > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
> > > >
> > > > Yes, it's not possible to have meDelete set for root, because
> > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > handling meDelete case should better do Assert(!isRoot).
> > > > > This code was introduced in
> > > > >   commit e14641197a5
> > > > >   Author: Alexander Korotkov <[email protected]>
> > > > >   Date:   2019-11-19 23:07:36 +0300
> > > > >
> > > > >       Fix deadlock between ginDeletePage() and ginStepRight()
> > > > >
> > > > > I didn't trace it further to see if it existed before that in some fashion.
> > > >
> > > > Yes.  I think generally this area needs to be reworked to become more
> > > > clear, and have vast more comments.  It was wrong from my side trying
> > > > to fix bugs there without reworking it into something more
> > > > appropriate.  I'm planning to put work on this during this week.
> > > >
> > > > > There's another oddity here: ginScanToDelete() requires that the root page has
> > > > > been locked by the caller already, but will afaict re-read the root page? But
> > > > > then have code to avoid locking it again, because that would not have worked?
> > > > > Seems odd.
> > > >
> > > >
> > > > It seems a bit odd for me that caller already have locked buffer, but
> > > > passes BlockNumber making us re-read the buffer.  But I'm not sure
> > > > that's the same as your point.  Could you, please, elaborate more on
> > > > this?
> > >
> > > Here is the refactoring patch.  Sorry for the delay.
> >
> > Hi, Andres and Alexander!
> >
> > I've looked into the patch v1.
> > Overall, it looks good to me.
>
> The refactor LGTM in general. The buffer-ownership rewrite looks
> cleaner and safer overall.
>
> > Some thoughts:
> >
> > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > allocated myStackItem->child after processing all children of the
> > current level, when they are not needed anymore?
> > Previously to this patch, palloc-ed "me" variable also was't freed at
> > recursion levels.
>
> I think freeing myStackItem->child inside recursive calls might not be
> worthwhile here. That node is intentionally reused for subsequent
> siblings at the same depth, and it carries state (leftBuffer) that can
> still be needed until the level is fully processed.
> Freeing/reallocating it per subtree would add churn and make the
> lifetime rules harder to reason about without meaningful memory
> savings (the number of nodes is bounded by tree depth, not number of
> pages). We currently free the chain once after ginScanToDelete()
> returns in ginVacuumPostingTree(), which matches the natural lifetime
> boundary
>
> > Could limiting the maximum recursion level be useful?
>
> Posting-tree depth is naturally small; a hard cap seems to add failure
> risk with little practical benefit.
>
> > In the comment to myStackItem before ginScanToDelete(), it might be
> > worth adding that after processing all pages on the current level,
> > myStackItem is not needed anymore.
> >
> > > > Yes, it's not possible to have meDelete set for root, because
> > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > handling meDelete case should better do Assert(!isRoot).
> > Looks like this additional Assert is not in patch v1.
> >
> > In the root call of ginScanToDelete(gvs, &root); we can add Assert
> > checking that its return result is false:
> > -               ginScanToDelete(gvs, &root);
> > +              deleted = ginScanToDelete(gvs, &root);
> > +.             Assert(!deleted); /* Root page is never deleted */
>
> + 1,  this is a good invariant check and improves readability
>
> One minor nit for the comment:
>
> The DataPageDeleteStack.buffer field comment says "valid only while
> recursing into children"
> this is true for internal pages, but for leaf pages the buffer is
> valid until the pageWasDeleted / leftBuffer decision. The validity
> window is actually "from when the caller sets it until the
> post-recursion cleanup."

Thank you for catching this.  I decided to remove this statement from
v2.  It's hard to explain the life cycle of the buffer clearly in one
sentence.  On the other hand, it's explained in the comments of
ginScanPostingTreeToDelete().

------
Regards,
Alexander Korotkov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-11 22:51     ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
@ 2026-03-12 08:41       ` Pavel Borisov <[email protected]>
  2026-03-12 09:05         ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Pavel Borisov @ 2026-03-12 08:41 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <[email protected]> wrote:
>
> On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <[email protected]> wrote:
> > On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <[email protected]> wrote:
> > >
> > > Hi, Andres and Alexander!
> > >
> > > On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
> > > > >
> > > > > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> > > > > >
> > > > > > While looking at converting more places to UnlockReleaseBuffer(), in the
> > > > > > course of making UnlockReleaseBuffer() faster than the two separate
> > > > > > operations, I found this code:
> > > > > >
> > > > > > static bool
> > > > > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> > > > > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > > > > > ...
> > > > > >
> > > > > >         if (!meDelete)
> > > > > >         {
> > > > > >                 if (BufferIsValid(me->leftBuffer))
> > > > > >                         UnlockReleaseBuffer(me->leftBuffer);
> > > > > >                 me->leftBuffer = buffer;
> > > > > >         }
> > > > > >         else
> > > > > >         {
> > > > > >                 if (!isRoot)
> > > > > >                         LockBuffer(buffer, GIN_UNLOCK);
> > > > > >
> > > > > >                 ReleaseBuffer(buffer);
> > > > > >         }
> > > > > >
> > > > > >         if (isRoot)
> > > > > >                 ReleaseBuffer(buffer);
> > > > > >
> > > > > >
> > > > > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > > > > > that's not reachable, because presumably the root page will always go down the
> > > > > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
> > > > >
> > > > > Yes, it's not possible to have meDelete set for root, because
> > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > > handling meDelete case should better do Assert(!isRoot).
> > > > > > This code was introduced in
> > > > > >   commit e14641197a5
> > > > > >   Author: Alexander Korotkov <[email protected]>
> > > > > >   Date:   2019-11-19 23:07:36 +0300
> > > > > >
> > > > > >       Fix deadlock between ginDeletePage() and ginStepRight()
> > > > > >
> > > > > > I didn't trace it further to see if it existed before that in some fashion.
> > > > >
> > > > > Yes.  I think generally this area needs to be reworked to become more
> > > > > clear, and have vast more comments.  It was wrong from my side trying
> > > > > to fix bugs there without reworking it into something more
> > > > > appropriate.  I'm planning to put work on this during this week.
> > > > >
> > > > > > There's another oddity here: ginScanToDelete() requires that the root page has
> > > > > > been locked by the caller already, but will afaict re-read the root page? But
> > > > > > then have code to avoid locking it again, because that would not have worked?
> > > > > > Seems odd.
> > > > >
> > > > >
> > > > > It seems a bit odd for me that caller already have locked buffer, but
> > > > > passes BlockNumber making us re-read the buffer.  But I'm not sure
> > > > > that's the same as your point.  Could you, please, elaborate more on
> > > > > this?
> > > >
> > > > Here is the refactoring patch.  Sorry for the delay.
> > >
> > > Hi, Andres and Alexander!
> > >
> > > I've looked into the patch v1.
> > > Overall, it looks good to me.
> >
> > The refactor LGTM in general. The buffer-ownership rewrite looks
> > cleaner and safer overall.
> >
> > > Some thoughts:
> > >
> > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > allocated myStackItem->child after processing all children of the
> > > current level, when they are not needed anymore?
> > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > recursion levels.
> >
> > I think freeing myStackItem->child inside recursive calls might not be
> > worthwhile here. That node is intentionally reused for subsequent
> > siblings at the same depth, and it carries state (leftBuffer) that can
> > still be needed until the level is fully processed.
> > Freeing/reallocating it per subtree would add churn and make the
> > lifetime rules harder to reason about without meaningful memory
> > savings (the number of nodes is bounded by tree depth, not number of
> > pages). We currently free the chain once after ginScanToDelete()
> > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > boundary
> >
> > > Could limiting the maximum recursion level be useful?
> >
> > Posting-tree depth is naturally small; a hard cap seems to add failure
> > risk with little practical benefit.
> >
> > > In the comment to myStackItem before ginScanToDelete(), it might be
> > > worth adding that after processing all pages on the current level,
> > > myStackItem is not needed anymore.
> > >
> > > > > Yes, it's not possible to have meDelete set for root, because
> > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > > handling meDelete case should better do Assert(!isRoot).
> > > Looks like this additional Assert is not in patch v1.
> > >
> > > In the root call of ginScanToDelete(gvs, &root); we can add Assert
> > > checking that its return result is false:
> > > -               ginScanToDelete(gvs, &root);
> > > +              deleted = ginScanToDelete(gvs, &root);
> > > +.             Assert(!deleted); /* Root page is never deleted */
> >
> > + 1,  this is a good invariant check and improves readability
> >
> > One minor nit for the comment:
> >
> > The DataPageDeleteStack.buffer field comment says "valid only while
> > recursing into children"
> > this is true for internal pages, but for leaf pages the buffer is
> > valid until the pageWasDeleted / leftBuffer decision. The validity
> > window is actually "from when the caller sets it until the
> > post-recursion cleanup."
>
> Thank you for catching this.  I decided to remove this statement from
> v2.  It's hard to explain the life cycle of the buffer clearly in one
> sentence.  On the other hand, it's explained in the comments of
> ginScanPostingTreeToDelete().

Hi, Alexander!
Patch v2 looks good to me. I also agree with Xuneng that this
refactoring improved the logic to make it look clearer.
Thank you for the explanation of buffers lifetime!

Regards,
Pavel Borisov





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-11 22:51     ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-12 08:41       ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
@ 2026-03-12 09:05         ` Alexander Korotkov <[email protected]>
  2026-03-12 10:12           ` Re:Re: Odd code around ginScanToDelete jinbinge <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Alexander Korotkov @ 2026-03-12 09:05 UTC (permalink / raw)
  To: Pavel Borisov <[email protected]>; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi, Pavel!

On Thu, Mar 12, 2026 at 10:41 AM Pavel Borisov <[email protected]> wrote:
>
> On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <[email protected]> wrote:
> >
> > On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <[email protected]> wrote:
> > > On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <[email protected]> wrote:
> > > >
> > > > Hi, Andres and Alexander!
> > > >
> > > > On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
> > > > >
> > > > > On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
> > > > > > >
> > > > > > > While looking at converting more places to UnlockReleaseBuffer(), in the
> > > > > > > course of making UnlockReleaseBuffer() faster than the two separate
> > > > > > > operations, I found this code:
> > > > > > >
> > > > > > > static bool
> > > > > > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
> > > > > > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
> > > > > > > ...
> > > > > > >
> > > > > > >         if (!meDelete)
> > > > > > >         {
> > > > > > >                 if (BufferIsValid(me->leftBuffer))
> > > > > > >                         UnlockReleaseBuffer(me->leftBuffer);
> > > > > > >                 me->leftBuffer = buffer;
> > > > > > >         }
> > > > > > >         else
> > > > > > >         {
> > > > > > >                 if (!isRoot)
> > > > > > >                         LockBuffer(buffer, GIN_UNLOCK);
> > > > > > >
> > > > > > >                 ReleaseBuffer(buffer);
> > > > > > >         }
> > > > > > >
> > > > > > >         if (isRoot)
> > > > > > >                 ReleaseBuffer(buffer);
> > > > > > >
> > > > > > >
> > > > > > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
> > > > > > > that's not reachable, because presumably the root page will always go down the
> > > > > > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
> > > > > >
> > > > > > Yes, it's not possible to have meDelete set for root, because
> > > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > > > handling meDelete case should better do Assert(!isRoot).
> > > > > > > This code was introduced in
> > > > > > >   commit e14641197a5
> > > > > > >   Author: Alexander Korotkov <[email protected]>
> > > > > > >   Date:   2019-11-19 23:07:36 +0300
> > > > > > >
> > > > > > >       Fix deadlock between ginDeletePage() and ginStepRight()
> > > > > > >
> > > > > > > I didn't trace it further to see if it existed before that in some fashion.
> > > > > >
> > > > > > Yes.  I think generally this area needs to be reworked to become more
> > > > > > clear, and have vast more comments.  It was wrong from my side trying
> > > > > > to fix bugs there without reworking it into something more
> > > > > > appropriate.  I'm planning to put work on this during this week.
> > > > > >
> > > > > > > There's another oddity here: ginScanToDelete() requires that the root page has
> > > > > > > been locked by the caller already, but will afaict re-read the root page? But
> > > > > > > then have code to avoid locking it again, because that would not have worked?
> > > > > > > Seems odd.
> > > > > >
> > > > > >
> > > > > > It seems a bit odd for me that caller already have locked buffer, but
> > > > > > passes BlockNumber making us re-read the buffer.  But I'm not sure
> > > > > > that's the same as your point.  Could you, please, elaborate more on
> > > > > > this?
> > > > >
> > > > > Here is the refactoring patch.  Sorry for the delay.
> > > >
> > > > Hi, Andres and Alexander!
> > > >
> > > > I've looked into the patch v1.
> > > > Overall, it looks good to me.
> > >
> > > The refactor LGTM in general. The buffer-ownership rewrite looks
> > > cleaner and safer overall.
> > >
> > > > Some thoughts:
> > > >
> > > > Is it worth/possible in recursive calls of ginScanToDelete() to free
> > > > allocated myStackItem->child after processing all children of the
> > > > current level, when they are not needed anymore?
> > > > Previously to this patch, palloc-ed "me" variable also was't freed at
> > > > recursion levels.
> > >
> > > I think freeing myStackItem->child inside recursive calls might not be
> > > worthwhile here. That node is intentionally reused for subsequent
> > > siblings at the same depth, and it carries state (leftBuffer) that can
> > > still be needed until the level is fully processed.
> > > Freeing/reallocating it per subtree would add churn and make the
> > > lifetime rules harder to reason about without meaningful memory
> > > savings (the number of nodes is bounded by tree depth, not number of
> > > pages). We currently free the chain once after ginScanToDelete()
> > > returns in ginVacuumPostingTree(), which matches the natural lifetime
> > > boundary
> > >
> > > > Could limiting the maximum recursion level be useful?
> > >
> > > Posting-tree depth is naturally small; a hard cap seems to add failure
> > > risk with little practical benefit.
> > >
> > > > In the comment to myStackItem before ginScanToDelete(), it might be
> > > > worth adding that after processing all pages on the current level,
> > > > myStackItem is not needed anymore.
> > > >
> > > > > > Yes, it's not possible to have meDelete set for root, because
> > > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > > > > handling meDelete case should better do Assert(!isRoot).
> > > > Looks like this additional Assert is not in patch v1.
> > > >
> > > > In the root call of ginScanToDelete(gvs, &root); we can add Assert
> > > > checking that its return result is false:
> > > > -               ginScanToDelete(gvs, &root);
> > > > +              deleted = ginScanToDelete(gvs, &root);
> > > > +.             Assert(!deleted); /* Root page is never deleted */
> > >
> > > + 1,  this is a good invariant check and improves readability
> > >
> > > One minor nit for the comment:
> > >
> > > The DataPageDeleteStack.buffer field comment says "valid only while
> > > recursing into children"
> > > this is true for internal pages, but for leaf pages the buffer is
> > > valid until the pageWasDeleted / leftBuffer decision. The validity
> > > window is actually "from when the caller sets it until the
> > > post-recursion cleanup."
> >
> > Thank you for catching this.  I decided to remove this statement from
> > v2.  It's hard to explain the life cycle of the buffer clearly in one
> > sentence.  On the other hand, it's explained in the comments of
> > ginScanPostingTreeToDelete().
>
> Patch v2 looks good to me. I also agree with Xuneng that this
> refactoring improved the logic to make it look clearer.
> Thank you for the explanation of buffers lifetime!

Thank you for your feedback. I'll push the patch if no objections.

------
Regards,
Alexander Korotkov
Supabase





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re:Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-10 09:19   ` Re: Odd code around ginScanToDelete Xuneng Zhou <[email protected]>
  2026-03-11 22:51     ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-12 08:41       ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
  2026-03-12 09:05         ` Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
@ 2026-03-12 10:12           ` jinbinge <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: jinbinge @ 2026-03-12 10:12 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; [email protected]; +Cc: Xuneng Zhou <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

At 2026-03-12 17:05:51, "Alexander Korotkov" <[email protected]> wrote:
>Hi, Pavel!
>
>On Thu, Mar 12, 2026 at 10:41 AM Pavel Borisov <[email protected]> wrote:
>>
>> On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <[email protected]> wrote:
>> >
>> > On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <[email protected]> wrote:
>> > > On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <[email protected]> wrote:
>> > > >
>> > > > Hi, Andres and Alexander!
>> > > >
>> > > > On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <[email protected]> wrote:
>> > > > >
>> > > > > On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <[email protected]> wrote:
>> > > > > >
>> > > > > > On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <[email protected]> wrote:
>> > > > > > >
>> > > > > > > While looking at converting more places to UnlockReleaseBuffer(), in the
>> > > > > > > course of making UnlockReleaseBuffer() faster than the two separate
>> > > > > > > operations, I found this code:
>> > > > > > >
>> > > > > > > static bool
>> > > > > > > ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
>> > > > > > >                                 DataPageDeleteStack *parent, OffsetNumber myoff)
>> > > > > > > ...
>> > > > > > >
>> > > > > > >         if (!meDelete)
>> > > > > > >         {
>> > > > > > >                 if (BufferIsValid(me->leftBuffer))
>> > > > > > >                         UnlockReleaseBuffer(me->leftBuffer);
>> > > > > > >                 me->leftBuffer = buffer;
>> > > > > > >         }
>> > > > > > >         else
>> > > > > > >         {
>> > > > > > >                 if (!isRoot)
>> > > > > > >                         LockBuffer(buffer, GIN_UNLOCK);
>> > > > > > >
>> > > > > > >                 ReleaseBuffer(buffer);
>> > > > > > >         }
>> > > > > > >
>> > > > > > >         if (isRoot)
>> > > > > > >                 ReleaseBuffer(buffer);
>> > > > > > >
>> > > > > > >
>> > > > > > > Which sure looks like it'd release buffer twice if isRoot is set?  I guess
>> > > > > > > that's not reachable, because presumably the root page will always go down the
>> > > > > > > !meDelete path. But it sure made me wonder if there's a hard to reach bug.
>> > > > > >
>> > > > > > Yes, it's not possible to have meDelete set for root, because
>> > > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
>> > > > > > handling meDelete case should better do Assert(!isRoot).
>> > > > > > > This code was introduced in
>> > > > > > >   commit e14641197a5
>> > > > > > >   Author: Alexander Korotkov <[email protected]>
>> > > > > > >   Date:   2019-11-19 23:07:36 +0300
>> > > > > > >
>> > > > > > >       Fix deadlock between ginDeletePage() and ginStepRight()
>> > > > > > >
>> > > > > > > I didn't trace it further to see if it existed before that in some fashion.
>> > > > > >
>> > > > > > Yes.  I think generally this area needs to be reworked to become more
>> > > > > > clear, and have vast more comments.  It was wrong from my side trying
>> > > > > > to fix bugs there without reworking it into something more
>> > > > > > appropriate.  I'm planning to put work on this during this week.
>> > > > > >
>> > > > > > > There's another oddity here: ginScanToDelete() requires that the root page has
>> > > > > > > been locked by the caller already, but will afaict re-read the root page? But
>> > > > > > > then have code to avoid locking it again, because that would not have worked?
>> > > > > > > Seems odd.
>> > > > > >
>> > > > > >
>> > > > > > It seems a bit odd for me that caller already have locked buffer, but
>> > > > > > passes BlockNumber making us re-read the buffer.  But I'm not sure
>> > > > > > that's the same as your point.  Could you, please, elaborate more on
>> > > > > > this?
>> > > > >
>> > > > > Here is the refactoring patch.  Sorry for the delay.
>> > > >
>> > > > Hi, Andres and Alexander!
>> > > >
>> > > > I've looked into the patch v1.
>> > > > Overall, it looks good to me.
>> > >
>> > > The refactor LGTM in general. The buffer-ownership rewrite looks
>> > > cleaner and safer overall.
>> > >
>> > > > Some thoughts:
>> > > >
>> > > > Is it worth/possible in recursive calls of ginScanToDelete() to free
>> > > > allocated myStackItem->child after processing all children of the
>> > > > current level, when they are not needed anymore?
>> > > > Previously to this patch, palloc-ed "me" variable also was't freed at
>> > > > recursion levels.
>> > >
>> > > I think freeing myStackItem->child inside recursive calls might not be
>> > > worthwhile here. That node is intentionally reused for subsequent
>> > > siblings at the same depth, and it carries state (leftBuffer) that can
>> > > still be needed until the level is fully processed.
>> > > Freeing/reallocating it per subtree would add churn and make the
>> > > lifetime rules harder to reason about without meaningful memory
>> > > savings (the number of nodes is bounded by tree depth, not number of
>> > > pages). We currently free the chain once after ginScanToDelete()
>> > > returns in ginVacuumPostingTree(), which matches the natural lifetime
>> > > boundary
>> > >
>> > > > Could limiting the maximum recursion level be useful?
>> > >
>> > > Posting-tree depth is naturally small; a hard cap seems to add failure
>> > > risk with little practical benefit.
>> > >
>> > > > In the comment to myStackItem before ginScanToDelete(), it might be
>> > > > worth adding that after processing all pages on the current level,
>> > > > myStackItem is not needed anymore.
>> > > >
>> > > > > > Yes, it's not possible to have meDelete set for root, because
>> > > > > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
>> > > > > > handling meDelete case should better do Assert(!isRoot).
>> > > > Looks like this additional Assert is not in patch v1.
>> > > >
>> > > > In the root call of ginScanToDelete(gvs, &root); we can add Assert
>> > > > checking that its return result is false:
>> > > > -               ginScanToDelete(gvs, &root);
>> > > > +              deleted = ginScanToDelete(gvs, &root);
>> > > > +.             Assert(!deleted); /* Root page is never deleted */
>> > >
>> > > + 1,  this is a good invariant check and improves readability
>> > >
>> > > One minor nit for the comment:
>> > >
>> > > The DataPageDeleteStack.buffer field comment says "valid only while
>> > > recursing into children"
>> > > this is true for internal pages, but for leaf pages the buffer is
>> > > valid until the pageWasDeleted / leftBuffer decision. The validity
>> > > window is actually "from when the caller sets it until the
>> > > post-recursion cleanup."
>> >
>> > Thank you for catching this.  I decided to remove this statement from
>> > v2.  It's hard to explain the life cycle of the buffer clearly in one
>> > sentence.  On the other hand, it's explained in the comments of
>> > ginScanPostingTreeToDelete().
>>
>> Patch v2 looks good to me. I also agree with Xuneng that this
>> refactoring improved the logic to make it look clearer.
>> Thank you for the explanation of buffers lifetime!
>
>Thank you for your feedback. I'll push the patch if no objections.
>
>------
>Regards,
>Alexander Korotkov
>Supabase

>




Hi,


Overall solid to me. I got a nitpick:
in the code comments, ginDeletePage should be ginDeletePostingPag(ginDeletePage -> ginDeletePostingPage).
There are 3 places that need to be revised.
------
Regards,
jinbinge

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Odd code around ginScanToDelete
  2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
  2026-03-06 14:45 ` Re: Odd code around ginScanToDelete Pavel Borisov <[email protected]>
@ 2026-03-11 22:41   ` Alexander Korotkov <[email protected]>
  1 sibling, 0 replies; 13+ messages in thread

From: Alexander Korotkov @ 2026-03-11 22:41 UTC (permalink / raw)
  To: Pavel Borisov <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi, Pavel!

Thank you for your review!

On Fri, Mar 6, 2026 at 4:45 PM Pavel Borisov <[email protected]> wrote:
> Some thoughts:
>
> Is it worth/possible in recursive calls of ginScanToDelete() to free
> allocated myStackItem->child after processing all children of the
> current level, when they are not needed anymore?
> Previously to this patch, palloc-ed "me" variable also was't freed at
> recursion levels.
>
> Could limiting the maximum recursion level be useful?
>
> In the comment to myStackItem before ginScanToDelete(), it might be
> worth adding that after processing all pages on the current level,
> myStackItem is not needed anymore.

Already answered in this thread.

> > > Yes, it's not possible to have meDelete set for root, because
> > > me->leftBuffer is always InvalidBuffer for the root.  So the branch
> > > handling meDelete case should better do Assert(!isRoot).
> Looks like this additional Assert is not in patch v1.
>
> In the root call of ginScanToDelete(gvs, &root); we can add Assert
> checking that its return result is false:
> -               ginScanToDelete(gvs, &root);
> +              deleted = ginScanToDelete(gvs, &root);
> +.             Assert(!deleted); /* Root page is never deleted */

Done.

> Additionally, it could be good to rename all vacuum functions related
> to posting tree pages only, to include "Posting" (e.g., ginDeletePage
> -> ginDeletePostingPage). The same is for the functions only for the
> entry tree. It is already named this way in many places (e.g.
> ginVacuumPostingTreeLeaves). It could be good to extend this to all
> relevant functions.

Renamed as you proposed.

> Several small proposals on wording:
> "rightmost non-deleted page to its left" -> "closest non-deleted
> sibling page to its left"

I renamed that to just "left sibling".  Deleted pages are not in the
tree already.  And "the rightmost page to its left" is just left
sibling.

> "each entry tracks the buffer of the page" -> "each entry tracks the
> buffers of the page" (as two buffers are mentioned)

I prefer to just use word "buffer" twice to make it more explicit.

> "must already be pinned" -> "must already have been pinned"

Done.

------
Regards,
Alexander Korotkov
Supabase


Attachments:

  [application/octet-stream] v2-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patch (11.1K, 2-v2-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patch)
  download | inline diff:
From 801286e1596c59f7267a05fbf79507739c1fd0f0 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <[email protected]>
Date: Sat, 21 Feb 2026 11:08:08 +0200
Subject: [PATCH v2] Rework ginScanToDelete() to pass Buffers instead of
 BlockNumbers.

Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and
re-read pages that were already pinned and locked during the tree walk.  The
caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet
ginScanToDelete() re-read it by block number with special-case code to skip
re-locking.

Rework both functions to pass Buffers directly.  DataPageDeleteStack now
carries buffer, myoff (downlink offset in parent), and isRoot per level,
so ginScanToDelete() takes only GinVacuumState and DataPageDeleteStack
pointers.  Also, ginDeletePage() receives the three Buffers directly, and
no longer reads or releases them itself.  The caller reads and locks child
pages before recursing, and manages buffer lifecycle afterward.

This eliminates the confusing isRoot special cases in buffer management,
including the apparent (but unreachable) double release of the root
buffer identified by Andres Freund.

Add comments explaining the locking protocol and the DataPageDeleteStack
structure.

Reported-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz
---
 src/backend/access/gin/ginvacuum.c | 190 +++++++++++++++++------------
 1 file changed, 113 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index c9f143f6c31..17674d7e03d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -110,31 +110,51 @@ xlogVacuumPage(Relation index, Buffer buffer)
 }
 
 
+/*
+ * Stack entry used during posting tree empty-page deletion scan.
+ *
+ * One DataPageDeleteStack entry is allocated per tree level.  As
+ * ginScanToDelete() recurses down the tree, each entry tracks the buffer
+ * of the page currently being visited at that level ('buffer'), and the
+ * buffer of its left sibling ('leftBuffer').  The left page is kept pinned
+ * and exclusively locked because ginDeletePage() needs it to update the
+ * sibling chain; acquiring it later could deadlock with ginStepRight(),
+ * which locks pages left-to-right.
+ */
 typedef struct DataPageDeleteStack
 {
 	struct DataPageDeleteStack *child;
 	struct DataPageDeleteStack *parent;
 
-	BlockNumber blkno;			/* current block number */
-	Buffer		leftBuffer;		/* pinned and locked rightest non-deleted page
-								 * on left */
+	Buffer		buffer;			/* buffer for the page being visited at this
+								 * tree level */
+	Buffer		leftBuffer;		/* pinned and locked rightmost non-deleted
+								 * sibling to the left of the current page */
+	OffsetNumber myoff;			/* offset of this page's downlink in the
+								 * parent */
 	bool		isRoot;
 } DataPageDeleteStack;
 
 
 /*
  * Delete a posting tree page.
+ *
+ * Removes the page identified by dBuffer from the posting tree by updating
+ * the left sibling's rightlink (in lBuffer) to skip over the deleted page,
+ * and removing the downlink from the parent page (in pBuffer).  All three
+ * buffers must already have been pinned and exclusively locked by the caller.
+ *
+ * The buffers are NOT released nor unlocked here; the caller is responsible
+ * for this.
  */
 static void
-ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno,
-			  BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot)
+ginDeletePostingPage(GinVacuumState *gvs, Buffer dBuffer, Buffer lBuffer,
+					 Buffer pBuffer, OffsetNumber myoff, bool isParentRoot)
 {
-	Buffer		dBuffer;
-	Buffer		lBuffer;
-	Buffer		pBuffer;
 	Page		page,
 				parentPage;
 	BlockNumber rightlink;
+	BlockNumber deleteBlkno = BufferGetBlockNumber(dBuffer);
 
 	/*
 	 * This function MUST be called only if someone of parent pages hold
@@ -142,12 +162,6 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	 * happen in this subtree. Caller also acquires Exclusive locks on
 	 * deletable, parent and left pages.
 	 */
-	lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
-								 RBM_NORMAL, gvs->strategy);
-	dBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, deleteBlkno,
-								 RBM_NORMAL, gvs->strategy);
-	pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno,
-								 RBM_NORMAL, gvs->strategy);
 
 	page = BufferGetPage(dBuffer);
 	rightlink = GinPageGetOpaque(page)->rightlink;
@@ -226,55 +240,37 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
-	ReleaseBuffer(pBuffer);
-	ReleaseBuffer(lBuffer);
-	ReleaseBuffer(dBuffer);
-
 	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
 
 /*
- * Scans posting tree and deletes empty pages.  Caller must lock root page for
- * cleanup.  During scan path from root to current page is kept exclusively
- * locked.  Also keep left page exclusively locked, because ginDeletePage()
- * needs it.  If we try to relock left page later, it could deadlock with
- * ginStepRight().
+ * Scans a posting tree and deletes empty pages.
+ *
+ * The caller must hold a cleanup lock on the root page to prevent concurrent
+ * inserts.  The entire path from the root down to the current page is kept
+ * exclusively locked throughout the scan.  The left sibling at each level is
+ * also kept locked, because ginDeletePage() needs it to update the rightlink
+ * of the left sibling; re-acquiring the left sibling lock later could
+ * deadlock with ginStepRight(), which acquires page locks left-to-right.
+ *
+ * All per-level state is carried in 'myStackItem': the buffer to process
+ * (must already be pinned and exclusively locked), the left sibling buffer,
+ * and this page's offset in the parent's downlink array.  The root entry is
+ * set up by ginVacuumPostingTree(); child entries are populated here before
+ * recursing.
+ *
+ * Returns true if the page was deleted, false otherwise.
  */
 static bool
-ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
-				DataPageDeleteStack *parent, OffsetNumber myoff)
+ginScanPostingTreeToDelete(GinVacuumState *gvs, DataPageDeleteStack *myStackItem)
 {
-	DataPageDeleteStack *me;
-	Buffer		buffer;
+	Buffer		buffer = myStackItem->buffer;
 	Page		page;
-	bool		meDelete = false;
+	bool		pageWasDeleted = false;
 	bool		isempty;
 
-	if (isRoot)
-	{
-		me = parent;
-	}
-	else
-	{
-		if (!parent->child)
-		{
-			me = palloc0_object(DataPageDeleteStack);
-			me->parent = parent;
-			parent->child = me;
-			me->leftBuffer = InvalidBuffer;
-		}
-		else
-			me = parent->child;
-	}
-
-	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-								RBM_NORMAL, gvs->strategy);
-
-	if (!isRoot)
-		LockBuffer(buffer, GIN_EXCLUSIVE);
-
 	page = BufferGetPage(buffer);
 
 	Assert(GinPageIsData(page));
@@ -283,19 +279,48 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 	{
 		OffsetNumber i;
 
-		me->blkno = blkno;
-		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
+		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff;)
 		{
 			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
+			Buffer		childBuffer;
+
+			childBuffer = ReadBufferExtended(gvs->index,
+											 MAIN_FORKNUM,
+											 PostingItemGetBlockNumber(pitem),
+											 RBM_NORMAL, gvs->strategy);
+			LockBuffer(childBuffer, GIN_EXCLUSIVE);
+
+			/* Allocate a child stack entry on first use; reuse thereafter */
+			if (!myStackItem->child)
+			{
+				myStackItem->child = palloc0_object(DataPageDeleteStack);
+				myStackItem->child->parent = myStackItem;
+				myStackItem->child->leftBuffer = InvalidBuffer;
+			}
 
-			if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), false, me, i))
-				i--;
+			myStackItem->child->buffer = childBuffer;
+			myStackItem->child->isRoot = false;
+			myStackItem->child->myoff = i;
+
+			/*
+			 * Recurse into child.  If the child page was deleted, its
+			 * downlink was removed from our page, so re-examine the same
+			 * offset; otherwise advance to the next downlink.
+			 */
+			if (!ginScanPostingTreeToDelete(gvs, myStackItem->child))
+				i++;
 		}
+		myStackItem->buffer = InvalidBuffer;
 
-		if (GinPageRightMost(page) && BufferIsValid(me->child->leftBuffer))
+		/*
+		 * After processing all children at this level, release the child
+		 * level's leftBuffer if we're at the rightmost page.  There is no
+		 * right sibling that could need it for deletion.
+		 */
+		if (GinPageRightMost(page) && BufferIsValid(myStackItem->child->leftBuffer))
 		{
-			UnlockReleaseBuffer(me->child->leftBuffer);
-			me->child->leftBuffer = InvalidBuffer;
+			UnlockReleaseBuffer(myStackItem->child->leftBuffer);
+			myStackItem->child->leftBuffer = InvalidBuffer;
 		}
 	}
 
@@ -306,34 +331,41 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 
 	if (isempty)
 	{
-		/* we never delete the left- or rightmost branch */
-		if (BufferIsValid(me->leftBuffer) && !GinPageRightMost(page))
+		/*
+		 * Proceed to the ginDeletePage() if that's not the leftmost or the
+		 * rightmost page.
+		 */
+		if (BufferIsValid(myStackItem->leftBuffer) && !GinPageRightMost(page))
 		{
-			Assert(!isRoot);
-			ginDeletePage(gvs, blkno, BufferGetBlockNumber(me->leftBuffer),
-						  me->parent->blkno, myoff, me->parent->isRoot);
-			meDelete = true;
+			Assert(!myStackItem->isRoot);
+			ginDeletePostingPage(gvs, buffer, myStackItem->leftBuffer,
+								 myStackItem->parent->buffer,
+								 myStackItem->myoff,
+								 myStackItem->parent->isRoot);
+			pageWasDeleted = true;
 		}
 	}
 
-	if (!meDelete)
+	if (!pageWasDeleted)
 	{
-		if (BufferIsValid(me->leftBuffer))
-			UnlockReleaseBuffer(me->leftBuffer);
-		me->leftBuffer = buffer;
+		/*
+		 * Keep this page as the new leftBuffer for this level: the next
+		 * sibling to the right might need it for deletion.  Release any
+		 * previously held left page first.
+		 */
+		if (BufferIsValid(myStackItem->leftBuffer))
+			UnlockReleaseBuffer(myStackItem->leftBuffer);
+		myStackItem->leftBuffer = buffer;
 	}
 	else
 	{
-		if (!isRoot)
-			LockBuffer(buffer, GIN_UNLOCK);
-
-		ReleaseBuffer(buffer);
+		/*
+		 * Page was deleted; release the buffer.  leftBuffer remains the same.
+		 */
+		UnlockReleaseBuffer(buffer);
 	}
 
-	if (isRoot)
-		ReleaseBuffer(buffer);
-
-	return meDelete;
+	return pageWasDeleted;
 }
 
 
@@ -417,6 +449,7 @@ ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
 		DataPageDeleteStack root,
 				   *ptr,
 				   *tmp;
+		bool		deleted PG_USED_FOR_ASSERTS_ONLY;
 
 		buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, rootBlkno,
 									RBM_NORMAL, gvs->strategy);
@@ -428,10 +461,13 @@ ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
 		LockBufferForCleanup(buffer);
 
 		memset(&root, 0, sizeof(DataPageDeleteStack));
+		root.buffer = buffer;
 		root.leftBuffer = InvalidBuffer;
+		root.myoff = InvalidOffsetNumber;
 		root.isRoot = true;
 
-		ginScanToDelete(gvs, rootBlkno, true, &root, InvalidOffsetNumber);
+		deleted = ginScanPostingTreeToDelete(gvs, &root);
+		Assert(!deleted);
 
 		ptr = root.child;
 
-- 
2.39.5 (Apple Git-154)



^ permalink  raw  reply  [nested|flat] 13+ messages in thread


end of thread, other threads:[~2026-03-16 08:31 UTC | newest]

Thread overview: 13+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-21 09:54 Re: Odd code around ginScanToDelete Alexander Korotkov <[email protected]>
2026-03-06 14:45 ` Pavel Borisov <[email protected]>
2026-03-10 09:19   ` Xuneng Zhou <[email protected]>
2026-03-10 09:28     ` Pavel Borisov <[email protected]>
2026-03-11 22:22       ` Alexander Korotkov <[email protected]>
2026-03-11 22:35         ` Pavel Borisov <[email protected]>
2026-03-11 22:37           ` Alexander Korotkov <[email protected]>
2026-03-16 08:31             ` Xuneng Zhou <[email protected]>
2026-03-11 22:51     ` Alexander Korotkov <[email protected]>
2026-03-12 08:41       ` Pavel Borisov <[email protected]>
2026-03-12 09:05         ` Alexander Korotkov <[email protected]>
2026-03-12 10:12           ` jinbinge <[email protected]>
2026-03-11 22:41   ` Alexander Korotkov <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox