MIME-Version: 1.0
References: <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me>
 <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
 <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me>
 <ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
 <CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
 <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
 <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
 <d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
 <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
 <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
In-Reply-To: <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 1 Mar 2026 15:39:55 -0500
Message-ID: 
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>,
 Alexandre Felipe <o.alexandre.felipe@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-Wznxu%2BAFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h%3DOKPAw%40mail.gmail.com>
Precedence: bulk

On Sat, Feb 28, 2026 at 5:08=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:
> The indexscan stats changes, if we want them, are also pretty verbose and
> pretty independent.

Then I'll remove them from the next revision. We can re-review that
decision at some point in the weeks ahead.

> It'd be great if the move of indexonly visibility check handling from
> nodeIndexonlyscan.c into the table AM could be split out, but it might be=
 too
> churn-y.  But getting it out of the way first would be nice.

That seems harder, but I'll see what I can do.

> This actually reminds me that it's not entirely clear to me what firstIte=
m,
> lastItem mean:
>
>         /*
>          * Matching items state for this batch.  Output by index AM for t=
able AM.
>          *
>          * The items array is always ordered in index order (ie, increasi=
ng
>          * indexoffset).  When scanning backwards it is convenient for in=
dex AMs
>          * to fill the array back-to-front, so we start at the last slot =
and fill
>          * downwards.  Hence they need a first-valid-entry and a last-val=
id-entry
>          * counter.
>          */
>         int                     firstItem;              /* first valid in=
dex in items[] */
>         int                     lastItem;               /* last valid ind=
ex in items[] */
>
> I think it should at very least state that -1 is used to indicate there a=
re no
> items.

It's a bit awkward because by definition it isn't possible for a batch
returned by amgetbatch to contain no items. amgetbatch always returns
a batch for the next index page that has at least one matching item
(or it returns NULL, indicating that the scan has completely run out
of matching items).

But I'll add a note about some index AMs needing these to be signed to
represent no-matching-items internally.

> I take it that if firstItem =3D=3D lastItem, there's exactly one item?

That's right.

> > Many months ago, this struct provided a way for index AMs to provide
> > their own state in each batch. But I ultimately concluded that the
> > overhead of that approach was a problem.
>
> With the approach that I hand waved at, namely to use one allocation for =
both
> the AM private state and the generic state, that should not be a real iss=
ue?
> There should be maybe 2-3 cheap instructions to translate?  I've expanded=
 on
> that further down.

I'll work towards that, then.

> > The names are based on the existing, similar moreLeft and moreRight
> > fields used by both nbtree and hash. I'm not attached to any of these
> > names. Feel free to suggest alternatives.
>
> knownEnd{Forward, Backward} or isEnd{Forward,Bckward} seems like it'd do =
the
> trick?

WFM.

>         /*
>          * If we are doing an index-only scan, these are the tuple storag=
e
>          * workspaces for the matching tuples (tuples referenced by items=
[]). Each
>          * is of size BLCKSZ, so it can hold as much as a full page's wor=
th of
>          * tuples.
>          */
>         char       *currTuples;         /* tuple storage for items[] */
>         BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items=
 */
>
> I previously read the "these are" to reference currTuples and items, but =
I
> think the plural is actually just an anachronism, from when currTuples wa=
s
> followed by markTuples?
>
> It says that it's the tuple storage for items[], but I think that may jus=
t be
> trying to say that it's the tuple storage corresponding the items in ->it=
ems[].

You're right. Will fix.


> I agree it shouldn't be exclusively managed by heapam and that it needs t=
o be
> stored together with the batch.  But it could be a size determined at the=
 time
> of index_beginscan() as part of table_index_fetch_begin(heapRelation), e.=
g. a
> field in the returned IndexFetchTableData.
>
> I'm basically thinking that each batch would be stored like this:
>
>    [table am specific data]
>    [index am specific data]
>    struct IndexScanBatchData
>    [variable amount of items]
>
> With the offset from the *IndexScanBatchData to the index/table data stor=
ed
> inside IndexScanDesc.
>
> Accessing the heap specific portion of an IndexScanBatch could then be
> something like
>   ((char*) batch - iscan->batch_table_offset)
> (wrapped in a helper, of course)
>
> The index AM conversion could be made even cheaper with the above, becaus=
e for
> the index AM the offset would typically be known at compile time
> (i.e. something like MAXALIGN(sizeof(BTIndexBatchData))).

That seems reasonable. I'll at least work towards that for the next
revision (might make sense to produce a revision without those changes
to fix bit rot from committing earlier patches).

> Hm. It looks like the only remaining users (_bt_check_unique(),
> unique_key_recheck()) don't care about actually getting the tuple, they j=
ust
> want to check if it exists.
>
> For those the API is actually unnecessarily expensive, we don't need to r=
eturn
> the tuple, therefore don't need to create a slot.  So once we get the mai=
n
> stuff in, I think we should evolve the API to be a bit more catered
> specifically to this usecase.  I think it'll be a lot less confusing that=
 way.

Noted.

> To be clear, I was just suggesting to elide repeated VM lookups for index
> entries that all point to the same table page. Not to actually do any bat=
ching
> or such. It's just four more lines or so.

I prototyped this. You were right; it's a no-brainer. It'll definitely
be in the next version.

> > I think that it makes perfect sense that the AM table controls this.
> > After all, holding on to buffer pins exists purely for heapam's
> > benefit -- it has nothing to do with index AM considerations.
>
> I think it's right that the table AM controls *when* the pins are release=
d,
> what doesn't seem right that it knows *which* specific buffer being relea=
sed
> is the right thing and that's done.

The fact is that the "Index Locking Considerations" section from
"Chapter 63. Index Access Method Interface Definition" in the docs
describes things that only make sense when using heapam, or a table AM
very much like it. Any design derived from the existing one will
likely retain some of that.

For example, the entire concept of ambuldelete requiring a cleanup
lock is unrelated to index AM considerations. If (say) nbtree only
MVCC snapshot plain index scans and bitmap index scans, then there'd
be no need for btbulkdelete to acquire cleanup locks, and no need for
any scan to ever hold on to a buffer pin. To me it seems natural to
make such buffer pins (those that specifically exist to form an
interlock against unsafe concurrent TID recycling by VACUUM) the
responsibility of the table AM -- that's the only place where it
actually matters. That's the main reason the AM table releases the
batch's buffer pin in the current design of index prefetching (though
another important factor is nestloop anti-joins with inner index-only
scans, as I detail below).

Here are some points I imagine we agree on already:

* It is definitely a good idea to drop index page buffer pins during
or right after amgetbatch returns. Doing so avoids unintended
interactions with the read stream.

* As a practical matter it makes sense to be maximally strict, by
requiring (at least during prefetching) that the index AM immediately
releases any buffer pins it acquires (acquires from an `amgetbatch`
call originating from the read stream callback). That way it simply
isn't possible for the read stream to be affected by these buffer pins
in the first place. IOW while a paranoid policy might not be strictly
necessary, it ends up being simpler that way.

* It is necessary to avoid bulk-loading a batch's visibility
information immediately (before prefetching begins) to prevent new
regressions in things like nestloop anti-joins with an inner
index-only scan. My microbenchmarking shows this clearly.

Specifically for index-only scans, the only way to drop a batch/index
page pin at the earliest possible opportunity like this is to bulk
save visibility information for the batch's items before dropping the
index page pin. This avoids concurrent TID recycling races with VACUUM
that could otherwise lead to wrong answers. At the same time,
index-only scans get the usual strong guarantee about not holding on
to buffer pins that could adversely affect the read stream in
unpredictable ways (just like plain index scans).

Note that I'm only discussing pins held by a batch; I'm not discussing
other pin types that certain index AMs (not including nbtree) must
retain for their own reasons. To me, the "batch pin vs index AM owned
pin" distinction is important; I try to be clear about the pin type
involved when discussing these issues.

Here's where you might disagree with me, or at least where I need
greater clarity to actually understand your position:

I think it follows from what I said that some piece of code (either in
the index AM or the table AM) must be responsible for saving
visibility info for a batch using the visibility map, and only then
releasing the associated buffer pin. I chose to do this in the table
AM, partly because heapam needs the flexibility to *not* release the
batch/index buffer pin right away. ISTM that heapam needs the
authority to choose whether or not to bulk load all visibility info
from each batch immediately. This is necessary for the first batch
returned during an index-only scan with (say) a LIMIT 5, where loading
all of the visibility information immediately would significantly hurt
performance.

How should I resolve the tension among all these goals? Is your
concern that heapam knows about buffer pins specifically, as opposed
to an abstract interlock concept?

> What if the index AM is one outside of
> postgres' buffer pool?

Then it also cannot support WAL-logging, I think. Where does that
leave the LSN trick?

> That helps with multiple pins for the same buffer, but not having to pin
> multiple *different* buffers.  Sure, the index AM could just hold onto th=
e
> additional buffers until the scan ends or such, but that's not really an
> option.

Why is it not an option? Is this purely because of the potential to
disrupt the read stream's management of the backend's buffer pin
limit?

The problem with that position is that it just isn't compatible with
how some non-nbtree index AMs work and have always worked. Most
notably, the hash index AM feels entitled to hold on to up to 2 extra
buffer pins for its own reasons. If hash cannot continue doing that,
and has no reasonable alternative, doesn't that mean that hash is
fundamentally incompatible with prefetching? No matter how we design
the abstractions?

I believe hash indexes aren't generally important, so I certainly
wouldn't mind dropping hash index support for Postgres 19. Having such
strict rules about index AMs holding onto buffer pins won't affect
nbtree at all, but a maximally strict rule about it seems pretty
severe to me. Can't we just invent a way for ambeginscan to advise the
table AM that it might need (say) as many as 2 extra buffer pins?

My assumption up until now (which the docs do mention) is that it's
okay for index AMs to hold on to a small number of buffer pins for
their own purposes. Admittedly my understanding of this isn't rigorous
-- I guess I figured that some small amount of that had to be
permitted, since a number of existing index AMs fundamentally expect
it. I meant to ask you to help me put that on a more rigorous footing.

> > Again, I didn't see much success when I experimented with this
> > yesterday. But it could be a question of needing the right query to
> > see any benefit.
>
> I'd say that even if it were not to improve perf (pretty sure I saw some
> though), not introducing fields that have more complicated lifetimes is j=
ust
> good for human understanding too.

I agree that creating a new IndexScanDesc.xs_visible field for
transient state like this is ugly.

I'll incorporate changes along these lines.

Thanks again
--=20
Peter Geoghegan