public inbox for [email protected]
help / color / mirror / Atom feedFrom: SATYANARAYANA NARLAPURAM <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Sat, 28 Mar 2026 23:15:55 -0700
Message-ID: <CAHg+QDdsOe_7wyN_fdeKQmGkCTGkhnkSUCSNGOFWQZZTO0CdUQ@mail.gmail.com> (raw)
In-Reply-To: <CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com>
References: <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
<d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
<CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
<ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
<CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
<vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
<CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
<CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
<CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
<CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
<t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
<CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
<CAH2-Wz=Vxsgas35ZzOJJW1ceqp9TJ2DFhKmXULwUAcVpfD73xA@mail.gmail.com>
<CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com>
Hi Peter,
On Fri, Mar 27, 2026 at 5:36 PM Peter Geoghegan <[email protected]> wrote:
> On Sun, Mar 22, 2026 at 9:14 PM Peter Geoghegan <[email protected]> wrote:
> > V17 is attached. This addresses most of your feedback, but defers
> > dealing with the trickier parts for now.
>
> Attached is v18, which addresses the tricky issues I skipped in v17.
>
> Highlights:
>
> * New commit (the first) that adds a heapam_indexscan.c file, and
> moves all of the relevant existing heapam_handler.c functions over to
> this new file. This commit is strictly mechanical, and has no
> functional impact.
>
> * Another new commit (the second) adds the new slot-based index scan
> interface, and pushes the VM lookups that currently take place in
> nodeIndexonlyScan.c into heapam. This includes the instrumentation
> changes that were previously in their own small patch. It also
> includes the VISITED_PAGES_LIMIT stuff -- there's no longer any commit
> that breaks VISITED_PAGES_LIMIT, nor is there a later commit that
> subsequently restores the functionality.
>
> * We now extract one key piece of infrastructure that other table AMs
> can now reuse to manage their own index scan batches: a new inline
> function called tableam_util_fetch_next_batch now appears as an
> indexbatch.h inline function.
>
> This function (which appeared under a different name in v17) manages
> deciding whether to call amgetbatch (performing the amgetbatch call
> when needed) or to simply return the next batch in line (because such
> a batch already exists/is already loaded in the ring buffer from
> earlier). It also detects and handles cross-batch changes in scan
> direction, and sets/reads the fields that track when a batch is
> definitely the last one in the given scan direction.
>
> I tried to go further by also placing the bulk of the code in
> heapam_index_getnext_scanbatch_pos in indexbatch.h, but ultimately
> concluded that it wasn't worth it. There isn't much code in
> heapam_index_getnext_scanbatch_pos to generalize. Some of that code
> coordinates quite directly with the heapam read stream callback.
>
> * Added an "isGuarded" field to IndexScanBatchData. That way we can
> assert that any batch loaded within the read stream callback is
> already unguarded/has no batch index page buffer pin still held. The
> field isn't just for assertions, though. We also rely on it in one or
> two places during release builds, which seemed clearer to me. For
> example, we no longer require that amunguardbatch routines be
> idempotent, which seems like a minor improvement.
>
> * Other work is broken out into its own commit, to better highlight
> issues that aren't strictly related to the amgetbatch changes.
> Specifically, a new commit adds the "don't unpin on rescan" behavior.
> These changes, which first appeared in the main amgetbatch commit, aim
> to minimize regressions with cached workloads (this was discussed
> briefly upthread, plus Andres and I discussed it over IM some weeks
> back).
>
> * Replaced what was previously an assert with an equivalent
> pg_assume(), within heapam_index_getnext_scanbatch_pos -- which is a
> very hot function. We now "pg_assume(batchringbuf->headBatch ==
> scanPos->batch)", allowing the compiler to exploit this existing
> invariant. This seems to help the compiler generate better code,
> presumably by allowing more efficient register allocation. I haven't
> looked at the disassembly just yet, but it helps with certain
> microbenchmarks enough to matter.
>
> * The "Add UnlockBufferGetLSN utility function" commit now uses an
> atomic operation (in the common case where it actually drops the index
> page's pin), just like UnlockReleaseBuffer itself following Andres'
> commit f39cb8c011062d65e146c1e9d1aae221e96d8320 from today.
>
> I completed this item quickly, after pulling from master a little
> while ago. The expectation for UnlockBufferGetLSN is that it be at
> least as good as a BufferGetLSNAtomic followed by a
> UnlockReleaseBuffer, so IMV it made sense to include this last minute
> addition in v18.
>
> * The "don't wait" patches are no longer included, since Andres
> committed them to the master branch just now. Thanks to both you
> (Andres) and Melanie for taking care of this very thorny issue!
>
> I believe I have now acted on all of Andres' feedback, with one minor
> exception: Andres disliked how we use 'if (scan->xs_heapfetch)' to
> determine whether to use the scan's batch cache (we don't want to use
> it at all if the scan is ending). I just ran out of steam, so this
> version doesn't address the problem. But I'm still tracking it as an
> open item (I don't disagree with Andres, but this problem is slightly
> awkward). Note that I *did* create a helper function here, so we at
> least avoid having 2 copies of the loop that looks for a free cache
> batch slot.
>
I am still reviewing / understanding the patch, a couple quick comments
based on my review.
1. Looks like off-by-one in the for loop in the
patch v18-0015-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch
@@ -730,7 +711,39 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct
io_uring_sqe *sqe)
ioh->op_data.read.iov_length,
ioh->op_data.read.offset);
+ for (int i = 0; i <= ioh->op_data.read.iov_length; i++, iov++)
+ io_size += iov->iov_len;
2. NIT comment: _bt_endpoint return type changed to IndexScanBatch from
bool in the patch
v18-0005-Add-interfaces-that-enable-index-prefetching.patch.
But the places it is returning false would be good to replace with NULL.
Though false is treat as 0/NULL and no compiler errors, it improves the
readability of the code.
if (!BufferIsValid(btfirstbatch->buf))
{
/*
* Empty index. Lock the whole relation, as nothing finer to lock
* exists.
*/
PredicateLockRelation(rel, scan->xs_snapshot);
_bt_parallel_done(scan);
return false;
}
Thanks,
Satya
view thread (367+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: index prefetching
In-Reply-To: <CAHg+QDdsOe_7wyN_fdeKQmGkCTGkhnkSUCSNGOFWQZZTO0CdUQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox