MIME-Version: 1.0
References: 
 <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
 <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
 <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
 <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
 <CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
 <CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
 <CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
 <t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
 <CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
 <jx7xsohhk3utl2tdvme4knj4ar6u5ujcgzrermfpqx3aahb2wr@hex6tuakcwyl>
In-Reply-To: <jx7xsohhk3utl2tdvme4knj4ar6u5ujcgzrermfpqx3aahb2wr@hex6tuakcwyl>
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 24 Mar 2026 21:28:23 -0400
Message-ID: 
 <CAH2-WzkgAN5C7kf6h5DbQzoKjfL3OA94PQos+yXX7uoHPOwQRg@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>,
 Alexandre Felipe <o.alexandre.felipe@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-WzkgAN5C7kf6h5DbQzoKjfL3OA94PQos%2ByXX7uoHPOwQRg%40mail.gmail.com>
Precedence: bulk

On Tue, Mar 24, 2026 at 12:26=E2=80=AFPM Andres Freund <andres@anarazel.de>=
 wrote:
> > > > This is preparatory work for an upcoming commit that will need xs_b=
lk
> > > > to manage buffer pin transfers between the scan and the executor sl=
ot.
> > >
> > > A subsequent commit adds an earlier ExecClearTuple(slot) to make the =
buffer
> > > refcount decrement cheaper (due to hitting the one-element cache in
> > > bufmgr.c). I wonder if it's worth pulling that into this commit? Most=
ly to
> > > make that larger commit smaller.
> >
> > Is it really worth doing that without also doing the
> > xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff? We need batches
> > to do the latter.
>
> I see perf benefits from it alone, yes.

Does that mean it should be in a separate commit?

The upcoming v18 will significantly change the patch set's structure
by breaking up the big patch so that the slot interface changes are in
their own commit. Along with pushing the VM accesses down into heapam.

That feels like a big enough change on its own. There is bound to be a
v19, so to me it makes sense to defer a decision on this kind of thing
until you see v18.

> > > The commit message doesn't mention that this affects ammarkpos/amrest=
rpos.
> >
> > It does not. But FWIW there's a prominent "Note" about it in the SGML d=
ocs.
>
> Approximately nobody looking at the commit to see what they need to chang=
e
> will see that...

In any case v17 updated the commit message to point out the removal of
ammarkpos/amrestrpos.

> > No, I didn't try. I doubt saving space is worthwhile, since we'll need
> > a relatively large allocation for batches used during index-only scans
> > regardless.
>
> Seems fine to not care for now. But, FWIW, the motivating reason wouldn't=
 be
> to to really save memory, it'd be to make it more likely the data fits in=
to a
> higher level of the cache.

Understood.

> > We're now advertising that indexam_util_batch_unlock is optional, and
> > that index AMs can go there own way when needed.
>
> Fair enough.

Cool.

> > > > +
> > > > +     /*
> > > > +      * heap blocks fetched counts (incremented by index_getnext_s=
lot calls
> > > > +      * within table AMs, though only during index-only scans)
> > > > +      */
> > > > +     uint64          nheapfetches;
> > > >  } IndexScanInstrumentation;
> > >
> > > s/heap/table/ for anything new imo.
> >
> > Usually I'd agree, but here we're using the same name as the one
> > presented in EXPLAIN ANALYZE. IMV this should match that. (Maybe the
> > EXPLAIN ANALYZE output should change, and this field name along with
> > it, but that's another discussion entirely.)
>
> I think if we add it into more and more places it'll get harder and harde=
r to
> eventually fix...

Okay, I'll make this change for v18.

> Code reuse and making it easier for other AMs to adapt this...

Of course, that was what I meant.

> > > I also suspect it'd be worth creating a new heapam.c file for this ne=
w
> > > code. heapam_index.c or such.
> >
> > I had thought about that myself. How would that be structured, in
> > terms of the commits?
>
> I'm imagining something like heapam_iscan.c or heapam_indexfetch.c or suc=
h.

> I'd introduce it by adding it in a commit that moves heap_hot_search_buff=
er(),
> and heapam_index_fetch_{begin,reset,end}() into it.
>
> Moving heap_hot_search_buffer() into the same file will be nice because i=
t'll
> allow partial inlining of it into some really performance sensitive funct=
ions.

I'm not opposed, of course. But let's leave that question until after
I post v18.

> I think we should move the seq/tid scan stuff into its own file too, but
> that's obviously a separate thread.

Makes sense.

> > > Any reason this isn't in index_beginscan_internal(), given both
> > > index_beginscan() and index_beginscan_parallel() need it?  I realize =
you'd
> > > need to add arguments to index_beginscan_internal(), but I don't see =
a problem
> > > with that.  Alternatively a helper for this seems like a possibility =
too.
> >
> > Fixed, by moving much more of the initialization done by each variant
> > (index_beginscan, index_beginscan_bitmap, index_beginscan_parallel)
> > into index_beginscan_internal itself.
>
> Nice.  Haven't checked out your new version yet - are you doing that as a
> separate commit?

Maybe. Again, let's see how things shake out in v18, and then revisit.

> I don't think we necessarily need the coverage of your full torture test =
suite
> in core, but I feel some basic sanity tests really ought to be in the cor=
e
> tests. There's very little, from what I can tell.  Even just making sure =
we
> have coverage for a index [only] scans going forward and backward, and th=
e
> same for merge & nestloop joins would be quite a win.

We probably have that level of coverage. What we lack is coverage for
things that are actually tricky.

For example, the regression tests lack coverage for an nbtree scan
that uses a scrollable cursor + an SAOP scan key that backs up across
a page boundary, relying directly on a call to the new amposreset
routine for correct behavior. Nor do we have the equivalent SAOP merge
join that restores a mark across a page/batch boundary (note that we
can safely skip the amposreset in the restore "scanBatch =3D=3D markBatch"
happy path).

The underlying difficulty is that these things tend to "acidentally
fail to fail". Some of these cases were rather difficult to write any
kind of test for, even for my own purposes.


> > IOW, when we're called through index_batchscan_end (specifically, when
> > we're called through index_batchscan_reset when it is called by
> > index_batchscan_end) we don't want to use the cache. It seemed to make
> > sense to implement this in a way that didn't require any special
> > handling from within indexam_util_batch_release (since it's not a
> > concern of index AMs).
>
> I am not really following.  Right now there's two close copies of this co=
de:

> Yes, one of them has the additional "if (scan->xs_heapfetch)" condition, =
but
> that hardly seems like a real problem preventing sharing the code.

I'll try to address this for v18.

Thanks
--=20
Peter Geoghegan