MIME-Version: 1.0
References: <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl>
 <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me> <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
 <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
 <CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
 <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
 <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com> <d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
In-Reply-To: <d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 28 Feb 2026 14:16:22 -0500
Message-ID: <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>, Alexandre Felipe <o.alexandre.felipe@gmail.com>, 
	Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, 
	Robert Haas <robertmhaas@gmail.com>, Melanie Plageman <melanieplageman@gmail.com>, 
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, 
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAH2-WznoD7vhjZNDj-5OrLp%2B1fjvW6ypEUwZ1%3DieadefgWaTDQ%40mail.gmail.com>
Precedence: bulk

On Fri, Feb 27, 2026 at 6:52=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:
> I know you just moved this code, but isn't it kinda bogus to use if
> (RelationIsPermanent()) here?  If the fake LSN is used on a durable page,=
 it
> wouldn't be safe to just use XLogAssignLSN(), as it'd not emit an FPI for=
 the
> page.
>
> All the gist callers use if (!RelationNeedsWAL()) to protect calls to thi=
s, so
> it's not a bug, but still seems somewhat odd.

Will fix.

> I'd probably not list the AMs here, because as of this commit, it'd not b=
e
> true yet.

I'll fix this too.

> Other than these points, I think it's probably worth pushing this soon?

Great. I think that it makes sense to push the first 2 patches
together. I'll work on doing that towards the end of the next week (so
around Friday March 6).

> BufferGetLSNAtomic()'s comment about not needing a lock if hint bits aren=
't
> used looks a bit bogus to me with this use-case - it makes it sound like =
we
> don't rely on the correctness of the return value if hint bits aren't
> enabled. But that's bogus, we do.  The relevant difference afaict is that
> without WAL logging of hint bits we will never set the LSN of a page that=
 is
> read locked, and therefore any caller that has at least a read lock doesn=
't
> need GetLSNAtomic() unless XLogHintBitIsNeeded().

Do you think this needs to be fixed in the first patch of this series?

> > diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtr=
ee/README
> > index 53d4a61dc..cb921ca2e 100644
> > --- a/src/backend/access/nbtree/README
> > +++ b/src/backend/access/nbtree/README
> > @@ -485,9 +485,8 @@ We handle this kill_prior_tuple race condition by h=
aving affected index
> >  scans conservatively assume that any change to the leaf page at all
> >  implies that it was reached by btbulkdelete in the interim period when=
 no
> >  buffer pin was held.  This is implemented by not setting any LP_DEAD b=
its
> > -on the leaf page at all when the page's LSN has changed.  (That won't =
work
> > -with an unlogged index, so for now we don't ever apply the "don't hold
> > -onto pin" optimization there.)
> > +on the leaf page at all when the page's LSN has changed.  (This is why=
 we
> > +implement "fake" LSNs for unlogged index relations.)
> >
>
> Does the reference to btbulkdelete actually cover all the relevant cases?
> Seems like a e.g. a concurrent _bt_delete_or_dedup_one_page() would also =
be a
> problem?

It's a bit complicated, though the exact details don't seem
particularly important to me.

Currently, it is barely possible for a query to call _bt_readpage to
read a page's items, and subsequently successfully set some LP_DEAD
bits on that same page even when a concurrent deduplication pass
occurred on the same page in the interim. In practice this can only
happen during a !dropPin nbtree index scan -- so it cannot ever happen
during most index scans. So, to answer your question: no, it isn't
strictly necessary to give up on setting LP_DEAD bits from within
_bt_killitems when there has been a concurrent
_bt_delete_or_dedup_one_page call. I think this might work unreliably,
its possibility is more an artefact of how things worked before the
dropPin behavior was added back in 2015.

The amgetbatch interface *requires* that all amgetbatch index AMs use
nbtree's dropPin behavior for all scans (though we no longer call it
dropPin; it's just how amgetbatch operates). I think that it's still
the case that we only really need to worry about concurrently TID
recycing hazards of the kind that can be avoided by requiring
btbulkdelete to acquire a cleanup lock.

> One thing that's somewhat sad about this logic is that the
>                 latestlsn =3D BufferGetLSNAtomic(buf);
>                 Assert(so->currPos.lsn <=3D latestlsn);
>                 if (so->currPos.lsn !=3D latestlsn)
>
> check will trigger even if all the changes were done due to hint bit
> FPIs. E.g. due to another backend hinting other index entries on the same=
 page
> as dirty... But anyway, that's not the "fault" of your change.

Yeah, that is suboptimal. But it's worked that way for most index
scans for over 10 years now. I complained about one aspect of how this
could be ineffective back in 2017:

https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZ=
n9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d

The problem is greatly ameliorated by the fact that the deletion
process opportunistically checks TIDs from index tuples not marked
LP_DEAD when it is cheap to do so in passing (when the TID points to a
heap block we need to access anyway). IIRC, when the regression tests
are run, the ntbree tuple deletion process deletes many more
non-LP_DEAD-marked index tuples than LP_DEAD marked tuples (obviously
this is workload dependent).

> This is somewhat orthogonal and just quick idea:
>
> I wonder if the worry about the overhead of BufferGetLSNAtomic() could be
> addressed by adding a ReleaseBuffer() version that returns the current LS=
N.

I'm open to anything that can help.

> Have you done any testing to see if the added BufferGetLSNAtomic() calls
> introduce a measurable overhead when using unlogged tables?

Yes. It definitely matters. That's why I've been treating that patch
of Andreas Karlsson's to make BufferGetLSNAtomic use an atomic op as a
dependency (though without posting it to this thread). My testing
seems to show that the patch almost eliminates the problem altogether.
This is particularly important for avoiding regressions in hash index
scans.

It would be good to get the BufferGetLSNAtomic atomic op patch out of
the way soon. Since it's clearly independently useful.

> If so, could the
> BufferGetLSNAtomic() be skipped if there currently the scan doesn't curre=
ntly
> have any killed items?

Not really, no. There's no way to know whether the page will have any
index tuples that can be LP_DEAD marked. We could perhaps invent
heuristics that try to predict it based on past trends, but that seems
rather unappealing. Especially considering that there are good ways of
making BufferGetLSNAtomic a lot cheaper.

> This is a huge change. Is there a chance we can break it up into more
> manageable chunks?

I'm open to any ideas about that (like the one you go on to suggest
about introducing IndexFetchHeapData.xs_blk in its own commit).

> A first note: Seems like this needs to update doc/src/sgml/indexam.sgml?

I arbitrarily decided to put all of the sgml doc updates into the
later prefetching commit. Purely because it was hard to avoid talking
about prefetching in the docs. I can put minimal sgml doc updates in
the amgetbatch commit.

> > +/*
> > + * Location of a BatchMatchingItem that appears in a IndexScanBatch re=
turned
> > + * by (and subsequently passed to) an amgetbatch routine
> > + */
> > +typedef struct BatchRingItemPos
> > +{
> > +     /* Position references a valid BatchRingBuffer.batches[] entry? *=
/
> > +     bool            valid;
> > +
> > +     /* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatc=
h */
> > +     uint8           batch;
> > +
> > +     /* IndexScanBatch.items[]-wise index to relevant BatchMatchingIte=
m */
> > +     int                     item;
> > +} BatchRingItemPos;
>
> Does item actually need to be a 32bit int?  This way there's a hole and
> presumably IndexScanBatch.items[] has a limited size?

We don't allocate more than one of these BatchRingItemPos structs
anywhere, and always do so on the stack. I think that we rely on them
being signed within _bt_readpage, when it generates its return value,
here:

"""
return (newbatch->firstItem <=3D newbatch->lastItem);
"""

During a forwards scan _bt_readpage that returns no items, we'll
return false when "newbatch->lastItem =3D itemIndex - 1;" makes
newbatch->lastItem have the value -1 (note that newbatch->firstItem is
always 0 during a _bt_readpage in the forwards scan direction).

> > +/*
> > + * Matching item returned by amgetbatch (in returned IndexScanBatch) d=
uring an
> > + * index scan.  Used by table AM to locate relevant matching table tup=
le.
> > + */
> > +typedef struct BatchMatchingItem
> > +{
> > +     ItemPointerData heapTid;        /* TID of referenced heap item */
>
> If we're generalizing this, I'd be inclined to rename it away from heap t=
o
> table.

WFM.

> > +     BlockNumber currPage;           /* Index page with matching items=
 */
> > +     BlockNumber prevPage;           /* currPage's left link */
> > +     BlockNumber nextPage;           /* currPage's right link */
>
> Hm. Are these, and some of the later fields, generic enough for all index
> types? There IIRC are index types out there that don't use buffers or blo=
cks
> in that sense.
>
> I wonder if individual index AMs need a way to influence what is stored p=
er
> batch.

How strongly do you feel about it?

Many months ago, this struct provided a way for index AMs to provide
their own state in each batch. But I ultimately concluded that the
overhead of that approach was a problem.

It's not as if there's no way that the index AM can manage its own
state that describes the progress of the scan. For example, nbtree
still manages its own array key state (and must provide its own
amposreset function to provide indexbatch.c with a way to invalidate
that state, such as when the scan direction changes).

Index AMs aren't obligated to use fields like currPage and nextPage;
they are explicitly private state. So I think it's perfectly possible
for an index AM that doesn't use buffers or pages to use the
amgetbatch interface, while fully managing its own progress state. It
would be straightforward with an index type that didn't support
scanning backwards. Even if the scan did support scrolling in either
direction, it could map currPage to some kind of private state entry.

> > +     Buffer          buf;                    /* currPage buf (invalid =
means unpinned) */
> > +     XLogRecPtr      lsn;                    /* currPage's LSN */
> > +
> > +     /* scan direction when the index page was read */
> > +     ScanDirection dir;
> > +
> > +     /*
> > +      * knownEndLeft and knownEndRight are used by table AMs to track =
whether
> > +      * there may be matching index entries to the left and right of c=
urrPage,
> > +      * respectively.  This helps them to avoid repeatedly calling amg=
etbatch.
> > +      */
> > +     bool            knownEndLeft;
> > +     bool            knownEndRight;
>
> If I understand correctly, these mean that that we'd reach the end of the=
 scan
> in one direction?  If so, why does it use left/right instead of
> forward/backward?

The names are based on the existing, similar moreLeft and moreRight
fields used by both nbtree and hash. I'm not attached to any of these
names. Feel free to suggest alternatives.

> > +     /*
> > +      * Matching items state for this batch.  Output by index AM for t=
able AM.
> > +      *
> > +      * The items array is always ordered in index order (ie, increasi=
ng
> > +      * indexoffset).  When scanning backwards it is convenient for in=
dex AMs
> > +      * to fill the array back-to-front, so we start at the last slot =
and fill
> > +      * downwards.  Hence they need a first-valid-entry and a last-val=
id-entry
> > +      * counter.
> > +      */
> > +     int                     firstItem;              /* first valid in=
dex in items[] */
> > +     int                     lastItem;               /* last valid ind=
ex in items[] */
>
> I'd make these unsigned.

They must be the same type as BatchRingItemPos.item offsets. So I
could probably make them int16, but it wouldn't be easy to make them
unsigned due to the aforementioned _bt_readpage issue.

> Not sure what "indexoffset" really means in a generic abstraction.

It means an offset into the currTuples buffer at which the table AM
can find an IndexTuple to return, during an index-only scan. Can you
suggest an alternative terminology and/or presentation?

> > +     /* info about killed items if any (killedItems is NULL if never u=
sed) */
> > +     int                     numKilled;              /* number of curr=
ently stored items */
> > +     int                *killedItems;        /* indexes of killed item=
s */
>
> Signedness again.

Another case where the underlying type used must be the same type as
BatchRingItemPos.item offsets.

> I'd specify what is being indexed by killedItems.

Will fix.

> What size is killedItems?

In practice it is always IndexScanDescData.maxitemsbatch, when
actually allocated. IOW, for nbtree it is MaxTIDsPerBTreePage (1358
items), while for hash it is MaxIndexTuplesPerPage (408 items).

> I wonder if "killed" is really the right term in the generic code, it's
> unclear whether it refers to table or index items. And the past tense is =
a bit
> odd too, since nothing may yet been "killed".

Will come up with a better name for the next revision.

> > +     /*
> > +      * If we are doing an index-only scan, this array stores the per-=
item
> > +      * visibility flags BATCH_VIS_CHECKED and BATCH_VIS_ALL_VISIBLE
> > +      */
> > +     uint8      *visInfo;            /* per-item visibility flags, or =
NULL */
>
> Why just for index only?  Seems like it'd be rather useful to batch visib=
ility
> checks for plain index scans too?  Obviously that doesn't have to be
> implemented immediately, but I'd probably not preclude it in generic
> infrastructure.

Whether or not we allocate visInfo space from the generic indexbatch.c
routines is determined by IndexScanDesc.xs_want_itup, though.

> This seems pretty heap specific.  I guess you didn't put it in
> IndexFetchHeapData because it'd be somewhat complicated to per-batch stat=
e in
> there?

Yes. And because we really want to keep the visInfo space right next
to the corresponding BatchMatchingItem/items[] state. Allocating and
releasing all batch state together is important.

> Perhaps my earlier suggestion about having "private" per-batch state for =
the
> indexam is required for table AMs too.  I guess we could store an offset =
to
> the "tableam private" and "indexam private" data in each batch.

I think that would be painful. Especially if the goal is to allocate
visInfo in memory that is exclusively managed by heapam.

> > +     /*
> > +      * If we are doing an index-only scan, this is the tuple storage =
workspace
> > +      * for the matching tuples (tuples referenced by items[]).  It is=
 of size
> > +      * BLCKSZ, so it can hold as much as a full page's worth of tuple=
s.
> > +      * currTuples points into the trailing portion of this allocation=
, just
> > +      * past items[].  It is NULL for plain index scans.
> > +      */
>
> This was a bit hard to understand before, but in generic code it seems li=
ke it
> is too specific to the way it was used in nbtree.

How so?

> > +
> > +/*
> > + * Maximum number of batches (leaf pages) we can keep in memory.  We n=
eed a
> > + * minimum of two, since we'll only consider releasing one batch when =
another
> > + * is read.
> > + */
> > +#define INDEX_SCAN_MAX_BATCHES               2
> > +#define INDEX_SCAN_CACHE_BATCHES     2
>
> Hm. These limits are increased in a later patch adding prefetching to
> heapam. I guess you kept it low until then to avoid some overhead?  Perha=
ps
> that's indicative that hardcoding it isn't great?

The fact that I only increase INDEX_SCAN_MAX_BATCHES in the later
prefetching patch was a fairly arbitrary choice.

> Hm, it's somewhat confusing that we have this now, but still have
> ->index_fetch_tuple() etc.

I don't see a way around that. Not as long as there's a need to
support table_index_fetch_tuple_check, which is fundamentally about
passing a TID owned by the caller.

> Don't really have an opinion about this, but it's not clear to me why thi=
s
> patch changed this?

Really? You actually suggested it (albeit in a completely different
context). It just felt like a good idea to make the allocation
dynamic, now that I'm adding more stuff to the instrumentation struct.
I can change it back, if you prefer.

> If we want to change it, could we do that in a prereq patch?

Sure.

> This seems independent too. Let's just introduce xs_blk and do this chang=
e
> separately? The we can get it out of the way?

Will do. The next revision will put these changes into their own patch.

> > + * Note on Memory Ordering Effects
> > + * -------------------------------
> > + *
> > + * visibilitymap_get_status does not lock the visibility map buffer, a=
nd
> > + * therefore the result we read here could be slightly stale.  However=
, it
> > + * can't be stale enough to matter.
> > + *
> > + * We need to detect clearing a VM bit due to an insert right away, be=
cause
> > + * the tuple is present in the index page but not visible.  The readin=
g of the
> > + * TID by this scan (using a shared lock on the index buffer) is seria=
lized
> > + * with the insert of the TID into the index (using an exclusive lock =
on the
> > + * index buffer).  Because the VM bit is cleared before updating the i=
ndex,
> > + * and locking/unlocking of the index page acts as a full memory barri=
er, we
> > + * are sure to see the cleared bit if we see a recently-inserted TID.
>
> I don't think the index page locking on the inserter side relevant - that=
 side
> of the memory barrier paring is provided solely by visibilitymap_set(), w=
hich
> uses an exclusive lock and thus also provides a barrier (so does the heap
> buffer locking).

These comments were moved from nodeIndexScan.c.

> > + * Deletes do not update the index page (only VACUUM will clear out th=
e TID),
>
> Well, not just, right? We can only prune when inserting into the index?

Again, these are direct copies of the existing comments.

> > + * so the clearing of the VM bit by a delete is not serialized with th=
is test
> > + * below, and we may see a value that is significantly stale.  However=
, we
> > + * don't care about the delete right away, because the tuple is still =
visible
> > + * until the deleting transaction commits or the statement ends (if it=
's our
> > + * transaction).  In either case, the lock on the VM buffer will have =
been
> > + * released (acting as a write barrier) after clearing the bit.  And f=
or us to
> > + * have a snapshot that includes the deleting transaction (making the =
tuple
> > + * invisible), we must have acquired ProcArrayLock after that time, ac=
ting as
> > + * a read barrier.
> > + *
> > + * It's worth going through this complexity to avoid needing to lock t=
he VM
> > + * buffer, which could cause significant contention.
> > + */
>
> If we are concerned about read side ordering, it seems a memory barrier o=
n the
> read side would suffice (and would not cause contention).
>
> There's a much more recent memory barrier already than the ProcArrayLock
> during snapshot acquisition - the pinning of the visibilitymap is a barri=
er.
>
> Which is good, because we should really remove the ProcArrayLock acquisit=
ion
> when we are able to reuse snapshots... And it'd be nasty if that introduc=
ed a
> hard to hit memory ordering issue.

Once again, these are direct copies of the existing comments.

Can you suggest a replacement comment block that comprehensively
addresses all your concerns? I think that these are all independent
issues. As far as I know there's no reason to think that the rules in
this area have changed (or need to change) due to our
structural/layering revisions.

> > +static void
> > +heapam_batch_resolve_visibility(IndexScanDesc scan, IndexScanBatch bat=
ch,
> > +                                                             BatchRing=
ItemPos *pos)
> > +{
>
> FWIW, for me this is not inlined into the caller, which means that there'=
s a
> function call even for the case where the visibility information would al=
ready
> be cached.

> I'd at least mark this branch likely, so that the compiler can see that i=
t'd
> make sense to move this check into the callsites. At least on gcc that se=
ems
> to suffice for the hot path to not need the function call.

I'll have something like that in the next revision.

> Hm. It'll be pretty common for the block number of two consecutive items =
to be
> on the same page.  The comments here suggest that the VM lookup is a rele=
vant
> performance factor, so why not check if the last VM_ALL_VISIBLE() was for=
 the
> same block?
>
> While visibilitymap_get_status() isn't particularly expensive, it still i=
s at
> least two external function calls (visibilitymap_get_status() and
> BufferGetBlockNumber()) and some nontrivial math.
>
>
> Heh, indeed: A quick hacky implementation makes
>   SELECT count(*) FROM pgbench_accounts
> about 16% faster.

I'm aware of several ways to speed up the bulk VM lookups, but I held
off because it didn't seem particularly crucial to this patch. For
example, Matthias's bug fix for GiST index-only scans actually adds
bulk VM lookups that use SIMD instructions.

Seems like independent work? I can add a patch like this if you think
that it's important.

> > +     /*
> > +      * It's safe to drop the batch's buffer pin as soon as we've reso=
lved the
> > +      * visibility status of all of its items
> > +      */
> > +     if (allbatchitemvisible && scan->MVCCScan)
> > +     {
> > +             Assert(batch->visInfo[batch->firstItem] & BATCH_VIS_CHECK=
ED);
> > +             Assert(batch->visInfo[batch->lastItem] & BATCH_VIS_CHECKE=
D);
> > +
> > +             ReleaseBuffer(batch->buf);
> > +             batch->buf =3D InvalidBuffer;
> > +     }
>
> It doesn't seem like it can be right that heapam releases an index buffer
> directly here.

I think that it makes perfect sense that the AM table controls this.
After all, holding on to buffer pins exists purely for heapam's
benefit -- it has nothing to do with index AM considerations.
(Actually, it kinda does with the _bt_killitems safety stuff, but IMV
that's just a table AM problem disguised as an index AM problem. And
one that can be fixed in a table-AM-agnostic way by always using the
LSN trick.)

The TID recycling hazards we're worried about here are inescapably
tied to how heapam VACUUM sets LP_DEAD stub line pointers in heap
pages to LP_UNUSED. It's entirely possible that another table AM will
unconditionally free each batch's buffer pin (which is close to what
heapam actually does). Or that such a table AM will have a completely
different policy.

> What, e.g., if the indexam requires multiple buffers to be
> pinned for correctness?

I agree that's not unusual. In fact, we often require this within hash
index scans.

That is handled in the later hash index patch by calling
IncrBufferRefCount against a to-be-returned batch's buffer just before
it is returned, as needed. This seems like a fairly elegant solution.
It allows the index AM to hold its own pin reference, which can be
released either before or after the table AM's reference, without any
special care.

> > +static pg_attribute_hot bool
> > +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> > +                                               TupleTableSlot *slot)
> > +{

> It might be worth putting this function into a static inline with a
> usebatchring argument, and calling it with a constant true/false from the
> top-level of heapam_index_getnext_slot().  Right now the code is a fair b=
it
> less tight due to both branches being supported inside the loop.
>
> Or even just storing scan->usebatchring in a local var (and passing it to
> index_fetch_heap()), so the compiler at least can realize that it won't c=
hange
> between iterations.

I experimented with these ideas yesterday, but for whatever reason
didn't see much of any win. I'll revisit it, though.

> > +                             tid =3D index_getnext_tid(scan, direction=
);
> > +
> > +                             if (tid !=3D NULL && scan->xs_want_itup)
> > +                                     scan->xs_visible =3D VM_ALL_VISIB=
LE(scan->heapRelation,
> > +                                                                      =
                                 ItemPointerGetBlockNumber(tid),
> > +                                                                      =
                                 &hscan->vmbuf);
>
> I'm a bit confused about the need for xs_visible (and also xs_heaptid, bu=
t
> that's old).  Afaict it's just used to store very transient information t=
hat
> could just be passed around upwards?

Yes. I agree that that's a little weird.

> Storing and then shortly after reading data can lead to increased latency=
 due
> to store-forwarding. And reading from memory is more expensive than just =
doing
> it via the register. And the compiler can't optimize the memory writes ou=
t in
> this kinda code, because it won't be able to proof it's never read later.

Again, I didn't see much success when I experimented with this
yesterday. But it could be a question of needing the right query to
see any benefit.

> > +                     else
> > +                     {
> > +                             /*
> > +                              * We didn't access the heap, so we'll ne=
ed to take a
> > +                              * predicate lock explicitly, as if we ha=
d.  For now we do
> > +                              * that at page level.
> > +                              */
> > +                             PredicateLockPage(hscan->xs_base.rel,
> > +                                                               ItemPoi=
nterGetBlockNumber(tid),
> > +                                                               scan->x=
s_snapshot);
> > +                     }
>
> FWIW, this does show up in profiles here (partially due to overhead of th=
e
> call, partially due to it being annoyingly expensive to get the block fro=
m a
> tid).

I've seen that too.

> Seems we could amortise this from once-per-tuple to once-per-block for
> the pretty common case of there being multiple tids for the same block in=
 a
> row?

It's not obvious that it'd be a win, since there's no convenient choke
point. We can't use IndexFetchHeapData.xs_blk for this, since that is
only set when the item was not all-visible (whereas this code path is
only used when it is). So we'd need to add a new field to
IndexFetchHeapData, just for this. We'd need to call
ItemPointerGetBlockNumber, to check if there's been any change in this
"all-visible heap block" field.

You might think that we could piggy-back some of this work on the read
stream callback. But this code is only used in the happy path, where
tuples are all-visible -- where we don't expect the read stream
callback to be called.

The most promising approach might also be the simplest: We could gate
the call to PredicateLockPage using something similar to
"MySerializableXact =3D=3D InvalidSerializableXact)". Maybe it'd be
possible to store that information only once, in IndexFetchHeapData,
or maybe in IndexScanDesc?

> > +                     /*
> > +                      * Return matching index tuple now set in scan->x=
s_itup (or return
> > +                      * matching heap tuple now set in scan->xs_hitup)=
.
> > +                      *
> > +                      * Note: we won't usually have fetched a heap tup=
le into caller's
> > +                      * table slot.  This is per the table_index_getne=
xt_slot contract
> > +                      * for scan->xs_want_itup callers.
> > +                      */
> > +                     return true;
>
> That contract is ... not very clear to me.  table_index_getnext_slot() sa=
ys:
>
>  * Fetch the next tuple from an index scan into `slot`, scanning in the
>  * specified direction. Returns true if a tuple was found, false otherwis=
e.
>  *
>  * The index scan should have been started via table_index_fetch_begin().
>  * Callers must check scan->xs_recheck and recheck scan keys if required.
>  *
>  * Index-only scan callers (that pass xs_want_itup=3Dtrue to index_begins=
can)
>  * can consume index tuple results by examining IndexScanDescData fields =
such
>  * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tupl=
e
>  * into the provided slot in the case of xs_want_itup=3Dtrue callers.
>
>
> I don't know what "won't usually" here really means and waht the caller i=
s
> supposed to do with that.

Most table_index_getnext_slot callers don't need to do anything about
this. But nodeIndexonlyscan.c and similar callers (basically all
index-only scan style callers) expect to examine certain IndexScanDesc
fields. They want to examine IndexScanDesc.xs_itup for the "slot that
was set by table_index_getnext_slot", and other similar items.

This is admittedly a bit odd. These callers do really need a table
slot -- that's required when heap accesses turn out to be needed on
the heapam side. But they don't really use the table slot directly
themselves.

Maybe it would make sense to invent a new
table_index_getnext_slot-like function, used only by the
IndexScanDesc.xs_itup callers. Passing the same table slot again and
again wouldn't be required with this function, which would be clearer
(the underlying heapam.c code would still require a table slot, but we
wouldn't be pretending to return heap tuples using that slot anymore).

> It's decidedly not free to store a tuple in a slot, so it seems a bit sil=
ly
> that we do so unnecessarily if we have to verify tuple visibility.

We're not actually doing so for these callers
(IndexScanDescData.xs_want_itup callers), except when we need to do
heap fetches.

> > From 34ad62ef8f43550ccc2c0b2e2c41f30205d37716 Mon Sep 17 00:00:00 2001
> > From: Peter Geoghegan <pg@bowt.ie>
> > Date: Thu, 12 Feb 2026 14:06:47 -0500
> > Subject: [PATCH v11 04/12] Don't allocate _bt_search stack
>
> IIC we've talked about this one for a while. Seems we could pull it ahead=
 of
> the larger 03?  Not pointlessly allocating memory only needed for modific=
ation
> during read-only searching seems like a hard-to-argue about improvement.

Will do. Maybe I can commit this one right after the earlier patches.

> > Subject: [PATCH v11 05/12] Limit get_actual_variable_range to scan one =
index
> >  leaf page.
>
> I skipped looking at this, as it seems like a separate axis. I have marke=
d it
> as something to look at in the other thread though.

It is at least somewhat related to the IndexScanDescData.xs_want_itup
callers I just went into.

> > From 3fbfb87b6f3ca3c4246c1d3145b452b25f398421 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Tue, 13 Jan 2026 20:44:14 +0100
> > Subject: [PATCH v11 06/12] Introduce read_stream_{pause,resume,yield}()=
.
>
>
> > Note: I (pgeoghegan) have extended read_stream_yield to support yieldin=
g
> > that avoids affecting the io_combine_limit mechanism.
>
> I doubt that goes far enough into fixing the problem with yielding
> early. Submitting a pending read before yielding makes sense, but it does=
n't
> really help with actually issuing IO early enough to actually get benefit=
s
> from AIO.

I agree.

> > +     /*
> > +      * Finally, remember new scan direction.
> > +      *
> > +      * Note: we needed to set xs_read_stream_dir to NoMovementScanDir=
ection
> > +      * momentarily to avoid spuriously prefetching more blocks from w=
ithin the
> > +      * read stream callback.  Once we return, the read stream can be =
used to
> > +      * fetch blocks in the opposite scan direction.
>
> I don't follow, if we are resetting, nothing should look at the
> xs_read_stream_dir.   Ah, I see that you elsewhere have a comment saying =
this
> is just needed due to a bug.

I fixed that bug in the past few days. It'll be in the next revision.


> Minor nit:  I find it pretty odd that we check for yielding if xs_yield_c=
heck
> is false.

> Where is that 4 coming from? It doesn't seem like having processed more t=
han 4
> batches is a particularly good proxy for anything?

> I don't really understand what logic behind yielding is.

(...Various other well-justified criticisms of yielding omitted...)

I like your idea of fixing the problems in this area (in the most
recent email, that I have yet to begin to respond to) by inventing a
new READ_STREAM_SLOW_START, and using it selectively, as indicated by
a hint passed in by the executor. Since I'm going to pursue that
approach next, there isn't much point in responding to your specific
points about yielding now.

> What's the workload showing the need for yielding most extremely? Is that=
 the
> merge join case we talked about before? I'd like to experiment with a few
> other approaches.

Yes.

> Ran out of time & energy for the moment, more later.

Thanks for the review!

--
Peter Geoghegan