MIME-Version: 1.0
References: <d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
 <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
 <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
 <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
 <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
 <CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
 <CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
 <CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
 <t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
 <bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh>
In-Reply-To: <bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh>
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 21 Mar 2026 21:17:08 -0400
Message-ID: 
 <CAH2-Wz=9Wc7T6xMxynHNu6majG_Q+=1v_OXCyYC-PbvagsaTrQ@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>,
 Alexandre Felipe <o.alexandre.felipe@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-Wz%3D9Wc7T6xMxynHNu6majG_Q%2B%3D1v_OXCyYC-PbvagsaTrQ%40mail.gmail.com>
Precedence: bulk

On Sat, Mar 21, 2026 at 3:54=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:
> I'm still wondering about whether 0003 can be split apart - it's just hug=
e and
> does a lot of things at once. I find it hard to get patches of that size =
into
> a mergeable state at once.
>
> What about splitting out the changes to the indexscan API, i.e. moving IO=
S
> handling to indexam.c / heapam_handler.c into its own commit?  I feel lik=
e
> that's a sizeable chunk that needs to touch a lot of places that don't ne=
ed to
> be touched in the commit adding batching. And it's such an architectural
> improvement that it's a worthwhile thing to do on its own.

I'll try to get that done for v18 (the version *after* the next
version I'll post, not the next version/v17). I want to commit the 2
patches that are now ready ahead of producing a v17. I don't want to
wait for a solution to this unrelated problem (nor do I want to leave
CFTester unhappy).

> One advantage of that would be that we could merge the VISITED_PAGES_LIMI=
T
> before the introduction of batches.

Good point.

> If the the amgetbatch introduction didn't also replace ammarkpos/amrestrp=
os
> with amposreset (see also question in previous email), I'd say that I'd s=
plit
> off the introduction of the batch interface from the conversion of btree =
to
> it. But it looks like it'd be not entirely trivial to keep both interface=
s
> around, so maybe that's not worth it.

Right. The whole concept of a batch kinda exists in nbtree already,
and has for a long time: there's always been markPos, which overwrites
the scan's currPos on restore. This shares many of the same features
(e.g., an independent markTuples array is used during index-only
scans). So there's just no practical way to separate the
btgetbatch/nbtree changes from the core changes that add functionality
to indexbatch.c.

> > +     /*
> > +      * Limit on distinct heap pages visited before giving up (0 =3D n=
o limit).
> > +      * Used by selfuncs.c to bound the cost of get_actual_variable_en=
dpoint().
> > +      */
> > +     uint8           xs_visited_pages_limit;
>
> I'd maybe phrase it a bit more vaguely, e.g
>
>     An approximate limit on the amount of work, measured in pages touched=
,
>     imposed on the index scan. The default, 0, means no limit. Used by
>     selfuncs.c to bound the cost of get_actual_variable_endpoint().
>
> so that, in the near future (presumably 20), we can just start to also ta=
ke
> the number of index pages visited into account.

Fixed, using your wording now.

> Because I do think your
> concern that leaving out the number of index pages from the limit is
> correct. The easiest way there seems to be to just have one shared limit.

Yeah, there's just no question that it's a real problem.

> > @@ -824,6 +824,7 @@ heapam_index_only_getnext_slot(IndexScanDesc scan, =
ScanDirection direction,
> >                               heap_continue =3D false;
> >
> >       Assert(scan->xs_want_itup);
> > +     Assert(scan->xs_visited_pages_limit =3D=3D 0);

> Why do we not need to support this for amgettuple indexes?

Fixed -- amgettuple index AMs will support the xs_visited_pages_limit
stuff in the next version.

> I wonder if the accounting for the number of visited pages wouldn't bette=
r be
> done in heapam_index_fetch_tuple(), as that already has a dedicated branc=
h for
> the cross-heap-page-boundary case.  I guess the current location was
> convenient because it provides for an easy control flow for ending the sc=
an?

That's right, that's what makes it convenient.

> > @@ -7180,6 +7180,7 @@ get_actual_variable_endpoint(Relation heapRel,
> >                                                                &Snapsho=
tNonVacuumable, NULL,
> >                                                                1, 0);
> >       Assert(index_scan->xs_want_itup);
> > +     index_scan->xs_visited_pages_limit =3D VISITED_PAGES_LIMIT;
> >       index_rescan(index_scan, scankeys, 1, NULL, 0);
>
> Wonder if it'd be worth doing this via a wrapper like index_limit_effort(=
)
> that could check that the scan is an IOS, uses batch mode etc.

We could always add it in the future. Maybe when fixing the issue with
scanning too many index leaf pages.

> > Testing has shown that index prefetching can make index scans much
> > faster.  Large range scans that return many tuples can be as much as 35=
x
> > faster.
>
> I think on networked storage (unfortunately what most postgres instances =
run
> on these days), there's easily much bigger wins :)

I'm excited to see what users do with this. That's such an enormous
difference that it likely enables using the system in ways that
wouldn't have made sense in the past.

> Perhaps worth mentioning that the wins are bigger with higher latency sto=
rage?

Will fix for next version.

> > The heuristic used to decide when to begin prefetching -- delaying unti=
l
> > the scan's second batch -- is imperfect.
>
> Perhaps worth introducing the heuristic, and the need for it, in a bit mo=
re
> detail?

Will do. I need to mention that the heuristic doesn't handle certain
cases well (which we've discussed privately). These are nestloop join
queries with a limit on the inner side of the scan that the prior
batch heuristic doesn't quite save. (Previously, the tuples_needed
patch helped with this, but since we're removing that patch, it makes
sense to frame this as a deficiency of the priorbatch heuristic.)

> > Selective index scans that
> > access randomly-ordered heap pages will benefit from prefetching, but
>
> s/will/would/?

Will fix.

> Continuing to think that a bunch of this really doesn't belong into relsc=
an.h
> but should be in a dedicated indexbatch.h or such.

It'll work that way in the next revision.

> Just reminded me: At some point soon we're going to have to adjust the co=
sting
> model to account for prefetching. I think there will be a fewer cases whe=
re
> bitmap scans are the correct choice than there are today...

I think so too.

Obviously, the main benefit here is significantly improved
performance. But ISTM that there's a second order benefit: the
uncertainty about how much slower an index scan will be in the worst
case is likely much lower. This is particularly true of index-only
scans that require at least some heap fetches; it's difficult to
predict how many a given scan will perform. While it isn't all that
bad when the heap pages are in shared_buffers, it's much much slower
when they aren't.

The optimizer has only a very basic understanding of how much data is
likely cached in shared_buffers. It's very hard to improve that model
because it relies so much on things that can change quickly. Making
the true cost profile of index scans simpler and more predictable
might make the optimizer's cost model much more accurate (perhaps with
a little work on the cost function side). It's just easier to fit a
cost function to that profile because being wrong (about IoS heap
fetches or correlation) matters significantly less.

> I suspect there's a good argument for moving the decision about whether t=
o
> enable prefetching to the planner instead of doing it here. But doing it =
here
> doesn't have a lot of architectural impact for now, so we don't necessari=
ly
> have to move the point of decision making now.

I think that it makes sense for the optimizer to give advice without
being prescriptive. Basically something a bit like the tuples_needed
patch, that table AMs are free to ignore when they find the hint to be
wrong.

> I think heapam_getnext_stream() is too generally named. For one it could =
just
> as well be for a sequential scan, for another it sounds like it's a funct=
ion
> like heap_getnext() or such.  Maybe heapam_index_scan_next_block() or suc=
h?

Fixed, will use your name in the next version.

> > +     /* Reset read stream direction unconditionally */
> > +     hscan->xs_read_stream_dir =3D NoMovementScanDirection;
>
> That's obviously the case given the assignment, but it doesn't even hint =
at
> why?

It's just defensive. I'll add a comment to that effect.

> > @@ -192,7 +210,14 @@ heapam_index_fetch_tuple(struct IndexFetchTableDat=
a *scan,
> >               if (BufferIsValid(hscan->xs_cbuf))
> >                       ReleaseBuffer(hscan->xs_cbuf);
> >
> > -             hscan->xs_cbuf =3D ReadBuffer(hscan->xs_base.rel, hscan->=
xs_blk);
> > +             /*
> > +              * When using a read stream, the stream will already know=
 which block
> > +              * number comes next (though an assertion will verify a m=
atch below)
> > +              */
> > +             if (hscan->xs_read_stream)
> > +                     hscan->xs_cbuf =3D read_stream_next_buffer(hscan-=
>xs_read_stream, NULL);
> > +             else
> > +                     hscan->xs_cbuf =3D ReadBuffer(hscan->xs_base.rel,=
 hscan->xs_blk);
> >
> >               /*
> >                * Prune page when it is pinned for the first time
>
> The assertion mentioned is only after the call to heap_pageprune_opt(). W=
hich
> is probably fine, but perhaps it's worth just doing it immediately after =
the
> above if block, so if the assertion is triggered one can figure out quick=
ly
> it's the fault of the read stream getting confused.

But that means that it won't be triggered when we don't enter the "if
(hscan->xs_blk !=3D ItemPointerGetBlockNumber(tid))" block that contains
all this code. Besides, it just doesn't seem possible that
heap_page_prune_opt would release its caller's pin.

> I think it'd be worth also explaining why it's safe to use the cached VM =
bits
> in the case where the tuples on an all-vsibile pge were only deleted afte=
r
> after we fetched the VM. I.e. that it's ok for this scan to think the pag=
e is
> all visible, because for the purpose of the tids it already had fetched f=
rom
> the index, it was all visible, i.e. all those tids will also be visible t=
o the
> current mvcc snapshot.

Added a small additional paragraph to that effect to my pending v17 of
the patch set.

> > @@ -558,6 +609,35 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirec=
tion direction,
>
> I'm also worried a bit about the namespacing of these functions (were
> introduced in an earlier commit, but ...). Not at all clear that a functi=
on of
> that name couldn't be part of a seqscan or such.
>
> Maybe we should prefix all of these in something like heapam_iscan_ or su=
ch?

I'm going to rename things as follows, which is both self-consistent
and as consistent with established convention as possible:

heapam_batch_getnext_tid -> heapam_index_getnext_tid
heapam_batch_getnext -> heapam_index_batch_getnext
heapam_batch_resolve_visibility -> heapam_index_resolve_visibility
heapam_batch_return_tid -> heapam_index_return_tid
heapam_dirchange_readstream_reset -> heapam_index_dirchange_reset

That'll be in the next version.

> > +             if (!hscan->xs_read_stream && priorBatch && scan->MVCCSca=
n &&
> > +                     hscan->xs_blk !=3D InvalidBlockNumber &&  /* for =
index-only scans */
> > +                     enable_indexscan_prefetch)
>
> I wonder if it'd be worth moving this into its own helper function
> (e.g. heapam_index_consider_prefetching()), just to make it a bit more ob=
vious
> that that's an important decision point.  I'd also expect the logic to be=
come
> a bit more complicated in the future.

That'll be in the next revision, too.

> It's too bad efficiency wise that we can't safely use
> READ_STREAM_USE_BATCHING.  For some workloads it looks like it's a decent
> improvement.  We could probably do the work to make it safe (basically ex=
iting
> batchmode in heapam_getnext_stream() when we have to do index IO and then
> re-enabling it afterwards), but that's clearly for later.

Agreed. Though I wasn't able to see much benefit here, when I tried that.

> > + */
> > +static pg_noinline void
> > +heapam_dirchange_readstream_reset(IndexFetchHeapData *hscan,

> Comparing this to heapam_index_fetch_reset():
> Should this also invalidate hscan->xs_prefetch_block?

It isn't necessary, because hscan->xs_prefetch_block is initialized
from scratch when the read stream callback is called following our
invalidating prefetchPos.

> And the other way round, should heapam_index_fetch_reset() not reset
> batchringbuf->prefetchPos.valid?

Nope, for the same reason.

> > +             /*
> > +              * If we're about to release the batch that prefetchPos c=
urrently
> > +              * points to, just invalidate prefetchPos.  We'll reiniti=
alize it
> > +              * using scanPos if and when heapam_getnext_stream is nex=
t called. (We
> > +              * must avoid confusing a prefetchPos->batch that's actua=
lly before
> > +              * headBatch with one that's after nextBatch due to uint8=
 overflow;
> > +              * simplest way is to invalidate prefetchPos like this.)
> > +              */
> > +             if (prefetchPos->valid &&
> > +                     prefetchPos->batch =3D=3D batchringbuf->headBatch=
)
> > +                     prefetchPos->valid =3D false;
> > +
>
> What causes us to reach this condition?

As a general rule, prefetchPos can only advance in the read stream
callback. It is mostly up to the read stream callback to notice when
prefetchPos has fallen behind so it can reinitialize from scanPos. But
we need this extra check to make its approach robust against
prefetchPos->batch getting so far behind that uint8 wrap makes it look
like it's ahead instead of behind.

Importantly, while prefetchPos can fall behind scanPos, it's only
possible in a limited, trivial sense: it can only fall behind for as
long as we go without calling the read stream callback. Once the read
stream callback is called, we can rely on its postcondition: when it
returns, prefetchPos must be >=3D scanPos. So, in effect, everything
behaves exactly as it would if it really was impossible for
prefetchPos to be < scanPos for even a picosecond (we don't want to
have to check prefetchPos needlessly in hot code paths).

> Seems like we should take care that
> the scan position doesn't overtake the prefetch position?

That's precisely what we're doing here. We're just not checking
prefetchPos on every item returned (we only do so when advancing to
the next batch) to save cycles.

> Does this only
> happen when paused?

This "prefetchPos->valid =3D false" stuff is approximately the opposite
of pausing. Pausing resolves the problem of prefetchPos getting so far
ahead of scanPos that the batch ring buffer runs out of slots. Whereas
this prefetchPos invalidation code helps the read stream deal with
prefetchPos falling behind scanPos.

This is a fairly rare edge-case, that doesn't actually need
consideration in the high-level design -- in a certain sense the high
level design pretends that prefetchPos cannot fall behind scanPos for
even one picosecond (it can, but not in a way that actually
matters/affects the invariants for the read stream callback).

Note that prefetchPos falling behind scanPos is most likely during
index-only scans that require some heap fetches, but not too many. The
read stream is seldom called, but is called with such a scan. So
prefetchPos could fall far behind scanPos (or would, if we didn't have
this code).

> > +     /*
> > +      * prefetchPos might not yet be valid.  It might have also fallen=
 behind
> > +      * scanPos.  Deal with both.
> > +      *
> > +      * If prefetchPos has not been initialized yet, that typically in=
dicates
> > +      * that this is the first call here for the entire scan.  We init=
ialize
> > +      * prefetchPos using the current scanPos, since the current scanB=
atch
> > +      * item's TID should have its block number returned by the read s=
tream
> > +      * first.
>
> Just for my understanding: I assume right now we'll always at the start o=
f a
> scan batch when prefetchPos is invalid?

Not quite.

We might have invalidated prefetchPos explicitly in that code path
that you just asked about. We deal with prefetchPos falling behind
scanPos there, but also here, in the read stream -- which has a
"index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 0"
check for that.

> I guess this could (not necessarily should) be avoided by having
> heapam_getnext_stream() advance the prefetchPos beyond a block of runs wi=
th
> the same buffer before returning?  If we did that, we should never fall b=
ehind
> anymore, is that right?

The problem isn't what the read stream callback does or doesn't do. As
I said, we can rely on its postcondition: when it returns, prefetchPos
must be >=3D scanPos. However, there's no way to be sure the read stream
callback will get called again before we consume many more batches --
particularly during index-only scans that need only a few heap fetches
(but also with plain index scans that see a run of leaf pages that
each have items that all point to the same heap block, which is
unlikely but certainly possible).

> Wonder if it's worth somehow asserting that after this the page is actual=
ly
> unguarded after the call.

We used to, but the new layering forced me to remove it. Any ideas
about how to add it back?

> > +             /*
> > +              * We never call amgetbatch without immediately releasing=
 the batch's
> > +              * index AM resources (which requires special care during=
 index-only
> > +              * scans).  The read stream is sensitive to buffer shorta=
ges, so we
> > +              * defensively avoid anything that visibly affects the pe=
r-backend
> > +              * buffer limit.
> > +              */
>
> It's probably just me (not the read stream) being sensitive, but I'd say =
that
> the read stream tries to be careful about not pinning too many buffers, a=
nd
> that's harder to do reliably if there are variable numbers of pins taken
> without such care...

Will use that wording.

> > +                             hscan->xs_paused =3D true;
> > +                             return read_stream_pause(stream);
> > +                     }
>
> Just for my understanding: What is the easiest way to hit this?  I assume=
 a
> well correlated indexscan with small heap tuples (and thus few heap pages=
) is
> the most obvious path?

Depends on what you mean. If you just want to test it quickly, then
you can easily do so by temporarily reducing INDEX_SCAN_MAX_BATCHES to
2 (the minimum usable value). If you meant to ask the most likely way
this will happen in the real world, I can think of a couple of things.

Possibly, the easiest way is to perform an index-only scan that
requires only a few heap fetches. These fetches can be spaced out such
that we have a hard time maintaining an adequate prefetch distance,
and have to read many leaf pages just to do one more fetch. (In
general, these kinds of index-only scans might be the best way to
create extreme conditions in the read stream callback.)

> > +                     prefetchBatch =3D heapam_batch_getnext(scan, xs_r=
ead_stream_dir,
> > +                                                                      =
                        prefetchBatch, prefetchPos);
> > +                     if (!prefetchBatch)
> > +                     {
> > +                             /*
> > +                              * Failed to load next batch, so all the =
batches that the scan
> > +                              * will ever require (barring a change in=
 scan direction) are
> > +                              * now loaded
> > +                              */
> > +                             return InvalidBlockNumber;
> > +                     }
>
> I assume failed means that we reached the end of the scan? If so failed s=
ounds
> a bit odd.

Fixed.

> > From 01164d1c646d441b508fb7b67fc8bea7583cb222 Mon Sep 17 00:00:00 2001
> > From: Peter Geoghegan <pg@bowt.ie>
> > Date: Sun, 18 Jan 2026 11:32:52 -0500
> > Subject: [PATCH v16 16/17] Add fake LSN support to hash index AM.

> If you reformulated the commit message to not reference the commit I see =
no
> real reason to hold off pushing this?

I'll commit this one ahead of the next revision, then.

> Admittedly I only skimmed it, but it looks like it looks pretty much the =
same
> as when I last looked at it (which is fine, I think it was ready to commi=
t
> then, except for the dependency on XLogGetFakeLSN, which hadn't yet been
> committed)?

Right.

> > @@ -510,7 +512,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,=
 Buffer ovflbuf,
> ..
> > +     else                                            /* !RelationNeeds=
WAL(rel) */
> > +     {
> > +             recptr =3D XLogGetFakeLSN(rel);
> > +
> > +             /* Determine if wbuf is modified */
> > +             if (nitups > 0)
> > +                     mod_wbuf =3D true;
> > +             else if (is_prev_bucket_same_wrt)
> > +                     mod_wbuf =3D true;
> > +     }
>
> Any reason to not just compute mod_wbuf before the if (NeedsWal()), just =
like
> you did for is_prim_bucket_same_wrt etc?

No good reason. Will fix.

Thanks again for the review!
--=20
Peter Geoghegan