Feedback-ID: id4a34324:Fastmail
Date: Tue, 10 Mar 2026 18:29:48 -0400
From: Andres Freund <andres@anarazel.de>
To: Peter Geoghegan <pg@bowt.ie>
Cc: Tomas Vondra <tomas@vondra.me>,
	Alexandre Felipe <o.alexandre.felipe@gmail.com>,
 Thomas Munro <thomas.munro@gmail.com>,
	Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>,
	Melanie Plageman <melanieplageman@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
	Georgios <gkokolatos@protonmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>,
	Dilip Kumar <dilipbalaut@gmail.com>
Subject: Re: index prefetching
Message-ID: <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
References: <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
 <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me>
 <ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
 <CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
 <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
 <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
 <d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
 <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
 <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: 
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx%40mof5q6li3xzv>
Precedence: bulk

Hi,

On 2026-03-01 15:39:55 -0500, Peter Geoghegan wrote:
> On Sat, Feb 28, 2026 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think that it makes perfect sense that the AM table controls this.
> > > After all, holding on to buffer pins exists purely for heapam's
> > > benefit -- it has nothing to do with index AM considerations.
> >
> > I think it's right that the table AM controls *when* the pins are released,
> > what doesn't seem right that it knows *which* specific buffer being released
> > is the right thing and that's done.
> 
> The fact is that the "Index Locking Considerations" section from
> "Chapter 63. Index Access Method Interface Definition" in the docs
> describes things that only make sense when using heapam, or a table AM
> very much like it. Any design derived from the existing one will
> likely retain some of that.
> 
> For example, the entire concept of ambuldelete requiring a cleanup
> lock is unrelated to index AM considerations. If (say) nbtree only
> MVCC snapshot plain index scans and bitmap index scans, then there'd
> be no need for btbulkdelete to acquire cleanup locks, and no need for
> any scan to ever hold on to a buffer pin.

This seems pretty unrelated to my concern.  I have a problem with the fact
that heapam knows which specific resources need to be held (&released) to
prevent concurrency issues during an index only scan.  I am *NOT* concerned
with there needing to be a pin and heapam triggering the release of that
resource.

All I am saying is that instead of

	/*
	 * It's safe to drop the batch's buffer pin as soon as we've resolved the
	 * visibility status of all of its items
	 */
	if (allbatchitemvisible && scan->MVCCScan)
	{
...
		ReleaseBuffer(batch->buf);
		batch->buf = InvalidBuffer;
	}

it should be something like

	/*
	 * It's safe to drop the batch's buffer pin as soon as we've resolved the
	 * visibility status of all of its items
	 */
	if (allbatchitemvisible && scan->MVCCScan)
	{
...
                index_batchscan_release(scan, batch);
	}

where, unfortunately, index_batchscan_release() has to call back into the
indexam to release the buffer pin.


> To me it seems natural to make such buffer pins (those that specifically
> exist to form an interlock against unsafe concurrent TID recycling by
> VACUUM) the responsibility of the table AM -- that's the only place where it
> actually matters. That's the main reason the AM table releases the batch's
> buffer pin in the current design of index prefetching (though another
> important factor is nestloop anti-joins with inner index-only scans, as I
> detail below).

To me it doesn't seem acceptable to require that index AMs need to have
exactly one buffer that needs to be pinned to prevent dangerous reuse. An
index AM should be free to

a) Prevent reuse by using a resource other than a buffer pin

   That's not possible in the patch, because we unconditionally do
   ReleaseBuffer(batch->buf)

b) Require multiple buffers to be pinned to prevent reuse

   That's not possible in the patch, because exactly one buffer is released.


> Here are some points I imagine we agree on already:
> 
> * It is definitely a good idea to drop index page buffer pins during
> or right after amgetbatch returns. Doing so avoids unintended
> interactions with the read stream.

Sure.


> * As a practical matter it makes sense to be maximally strict, by
> requiring (at least during prefetching) that the index AM immediately
> releases any buffer pins it acquires (acquires from an `amgetbatch`
> call originating from the read stream callback). That way it simply
> isn't possible for the read stream to be affected by these buffer pins
> in the first place. IOW while a paranoid policy might not be strictly
> necessary, it ends up being simpler that way.

Sure.


> * It is necessary to avoid bulk-loading a batch's visibility
> information immediately (before prefetching begins) to prevent new
> regressions in things like nestloop anti-joins with an inner
> index-only scan. My microbenchmarking shows this clearly.

It's a very plausible way, at least.  I can think of other ways to prevent
regressions, but I don't really see a reason to go for them.


> Specifically for index-only scans, the only way to drop a batch/index
> page pin at the earliest possible opportunity like this is to bulk
> save visibility information for the batch's items before dropping the
> index page pin. This avoids concurrent TID recycling races with VACUUM
> that could otherwise lead to wrong answers. At the same time,
> index-only scans get the usual strong guarantee about not holding on
> to buffer pins that could adversely affect the read stream in
> unpredictable ways (just like plain index scans).

Right.


> Here's where you might disagree with me, or at least where I need
> greater clarity to actually understand your position:
> 
> I think it follows from what I said that some piece of code (either in
> the index AM or the table AM) must be responsible for saving
> visibility info for a batch using the visibility map, and only then
> releasing the associated buffer pin. I chose to do this in the table
> AM, partly because heapam needs the flexibility to *not* release the
> batch/index buffer pin right away. ISTM that heapam needs the
> authority to choose whether or not to bulk load all visibility info
> from each batch immediately. This is necessary for the first batch
> returned during an index-only scan with (say) a LIMIT 5, where loading
> all of the visibility information immediately would significantly hurt
> performance.

Agreed.


> How should I resolve the tension among all these goals? Is your
> concern that heapam knows about buffer pins specifically, as opposed
> to an abstract interlock concept?

Yes.


> > What if the index AM is one outside of
> > postgres' buffer pool?
> 
> Then it also cannot support WAL-logging, I think. Where does that
> leave the LSN trick?

I don't think WAL-logging requires using the buffer pool.  It might have at
some point, but these days you can just create an rmgr of your own and manage
everything that way.  Yes, dealing with FPIs will be a tad more work, but it's
entirely doable.


> > That helps with multiple pins for the same buffer, but not having to pin
> > multiple *different* buffers.  Sure, the index AM could just hold onto the
> > additional buffers until the scan ends or such, but that's not really an
> > option.
> 
> Why is it not an option?

If we care about releasing one pin soon, why should we not care about
the timely release of other pins?


> Is this purely because of the potential to
> disrupt the read stream's management of the backend's buffer pin
> limit?

I'm not particularly bothered by a small number of extra buffer pins,
particularly for AMs other than nbtree. They won't cause issues in any real
world setups.  If nbtree kept a slightly larger number of pins around, I'd
expect *some* problems, due to us running our tests with very small number of
shared buffers.  Imagine the fun of occasionally running out of pins,
depending on when cache invalidations arrive and trigger catalog accesses.


Greetings,

Andres Freund