Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w6jRT-004b2S-2Z for pgsql-hackers@arkaria.postgresql.org; Sun, 29 Mar 2026 06:16:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w6jRS-00FoKx-0Y for pgsql-hackers@arkaria.postgresql.org; Sun, 29 Mar 2026 06:16:10 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w6jRR-00FoKj-22 for pgsql-hackers@lists.postgresql.org; Sun, 29 Mar 2026 06:16:10 +0000 Received: from mail-vs1-xe32.google.com ([2607:f8b0:4864:20::e32]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w6jRP-00000001bUU-1Iqk for pgsql-hackers@lists.postgresql.org; Sun, 29 Mar 2026 06:16:08 +0000 Received: by mail-vs1-xe32.google.com with SMTP id ada2fe7eead31-5ffc8987050so2911771137.0 for ; Sat, 28 Mar 2026 23:16:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774764966; cv=none; d=google.com; s=arc-20240605; b=QV9x6uxnV1P+BuCITZsb3ba7GiQnkH/0Vwevz96qYuovWQ5oQREsnNQaGB11zgehYj F3/fEldDWueuun2DFpmQysJ6N7Q1PjwjN6tnAMfztnn4GKy9tC0as/7bxCvoW+TysKJW XaETYsnvvB18NhScGi9pXFjmqX/g8R7eb7RbGlYcfVBCvpQ7SE+x95bR0i1LYlUdIftl DN5im34a/6wcEMVJp+ixS69UiT6ZwPcgcevpjOv++fqLi4lX9lZ3NLts4L+u8mN2X6dP Qs+tcUL2GmwPiSA7mJjFL9wQvBYusmcL7XK6F1TnLed253/fwfJIDXi3QZFgN7A5hI15 GlrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=2O2hvybkue0zDCyr3tfT7shL6v/OU4dyyjkbDHz/l3w=; fh=jcmGwl+DoUMQoscFenHhV+oAo4vDeQ0UBkJmgrG7CO4=; b=kIDRctQhEqGrFBIjtGouVYRdx869KT2h7b2oZ382JZ2sZ9WnWuu86dV6RR6QvF51Pg WCBcfGdm4NFN1dyQqjhzIx4X30RjPgfcSl+roF85hGjHsd/IU1KCWqmmibEfN/tUQeuF nkKrMupJHxyVU39yJPZ+iUbLRF+mBWncxjnoPMiNQ0d0BjVkLpkZVW9M7Dv7bJCSUHeL 0HQUjzoX7GQeC60mV5BAtxa1aQnQ79K4h1SFo5eyU6F6f4STYvgjnvpJW3TeEHagBS9U j6F17E86kIOsUzXqMW9kodQnZpaJgj9CH9KWn0xyMh4dBjmejLoh35N6P0VMeDaTOyuY T4GA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774764966; x=1775369766; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=2O2hvybkue0zDCyr3tfT7shL6v/OU4dyyjkbDHz/l3w=; b=FXvqF8aqiPPLjm51nYvhefm+wQRpUStEoRVdEbSVQTqCt/HLKeNOZDn2ggy+64iD9Q GxpzrNkLYjoUMpsuL1W8gHW+RMqb4u5pqZFkHcU9vpH+aN7dGAtXBzCe19ctvuqBxY6O 06xxYN2qmYoIom471xxZZLAdThg2V+gvpjvdPPfxswwf+eMt1LY1zIeB/hBhFPKQLhJh j3ndImmTFGUkoIuzIuYOo+dbQrVgPnZ/Do7zM/RPDwfTadTqBMiPVU9aWYgqi59paPnP j5SxHm8wTPJTuIWWvonm+YX8I2xRxEPWRFLBNSNhTKoe2oJdmCea9k0CEftqqBGYPrFp llAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774764966; x=1775369766; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2O2hvybkue0zDCyr3tfT7shL6v/OU4dyyjkbDHz/l3w=; b=kV4yHd2Fnu+7PR93lfPQ+MfHbJZBpefrbFPHE5m6DSok0hplW1/ayjnX4DvzG8Z20L L2EmM3qUZqV+cRczlYKRy+MRxZUMTWDRFOzetM0ld3efgavolporFQtdfYrT9LceowNy 2KHWuOZlkGLxIxK3/nJfpMdk5UpcQbP1IFHONhX8aKL7C8qnijpfuA6LmymTUNAm/TNg HQ+ElRfgRA7oJX+1Q5RUU1MNzh5rpXBHtzn1i501dvMl3rrr6u4XfY5nOiPUJSUA+ek1 JEE0lNOyB6ODZ2hbnwvOc1hyuMb3JjEosI3FFyZ7aAxvA5Ncmb61hfx5edRY5GJiSvFk bsjg== X-Forwarded-Encrypted: i=1; AJvYcCWqK4fYsQomnf0CdRjXzdQb2Sj6Um1VcJs6OzyKyrFuNgchQZR7yEsJWTt+1qHFpip+ky8BaMT23/m2rJE5@lists.postgresql.org X-Gm-Message-State: AOJu0Yz0ipXd1oQQBZNTa9ldc6/4tyNZnJkqGvm1IfUXeqBqv4JFrLCe TLvswW/YQyMUD67141viAhXyDpaH+jfh62ClJj4Hp2w+7yAgM3zClhGuwaPDI1BmHVjq+MQ0Jf2 R0ERdfwBCXtBFhhznSiXf1JXKiz0tkXU= X-Gm-Gg: ATEYQzxxu6X43cKQflL5Q5E5bK+JDRSUmShnIOfkqw5Dxk29uFNHvZRovKC85y7fit6 wBB0g4nxaq04iPmr0j4GeQ6x3vsG2rxj73xpc79r37aJoXpAgWzRAxCPs66MkmLU7f4eN7r3NeW JagVj6a86B4SzlUJv9a2VqdRk65yxOZINs2hNxG/kU9naWuKHhVZMNw0cIS21wKDI4skzxRji2d 6sy5QmIZqFP2/9VjYSjZ1hxcQG9/7V2HI6uCi4GtNkun9Q3AdLTWQRRLyJNTSVC0Z1oFCfnRdf4 nHd6la4= X-Received: by 2002:a05:6102:6c3:b0:602:b868:e597 with SMTP id ada2fe7eead31-604f9278b7emr3442357137.17.1774764966450; Sat, 28 Mar 2026 23:16:06 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: SATYANARAYANA NARLAPURAM Date: Sat, 28 Mar 2026 23:15:55 -0700 X-Gm-Features: AQROBzBGv5nKKodZlbLKw3vMXZAh39xT_ORnhwDOLkr8TVL7hAF-nvivs60FYR0 Message-ID: Subject: Re: index prefetching To: Peter Geoghegan Cc: Andres Freund , Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: multipart/alternative; boundary="000000000000629c70064e23ab29" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000629c70064e23ab29 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Peter, On Fri, Mar 27, 2026 at 5:36=E2=80=AFPM Peter Geoghegan wrote: > On Sun, Mar 22, 2026 at 9:14=E2=80=AFPM Peter Geoghegan wrot= e: > > V17 is attached. This addresses most of your feedback, but defers > > dealing with the trickier parts for now. > > Attached is v18, which addresses the tricky issues I skipped in v17. > > Highlights: > > * New commit (the first) that adds a heapam_indexscan.c file, and > moves all of the relevant existing heapam_handler.c functions over to > this new file. This commit is strictly mechanical, and has no > functional impact. > > * Another new commit (the second) adds the new slot-based index scan > interface, and pushes the VM lookups that currently take place in > nodeIndexonlyScan.c into heapam. This includes the instrumentation > changes that were previously in their own small patch. It also > includes the VISITED_PAGES_LIMIT stuff -- there's no longer any commit > that breaks VISITED_PAGES_LIMIT, nor is there a later commit that > subsequently restores the functionality. > > * We now extract one key piece of infrastructure that other table AMs > can now reuse to manage their own index scan batches: a new inline > function called tableam_util_fetch_next_batch now appears as an > indexbatch.h inline function. > > This function (which appeared under a different name in v17) manages > deciding whether to call amgetbatch (performing the amgetbatch call > when needed) or to simply return the next batch in line (because such > a batch already exists/is already loaded in the ring buffer from > earlier). It also detects and handles cross-batch changes in scan > direction, and sets/reads the fields that track when a batch is > definitely the last one in the given scan direction. > > I tried to go further by also placing the bulk of the code in > heapam_index_getnext_scanbatch_pos in indexbatch.h, but ultimately > concluded that it wasn't worth it. There isn't much code in > heapam_index_getnext_scanbatch_pos to generalize. Some of that code > coordinates quite directly with the heapam read stream callback. > > * Added an "isGuarded" field to IndexScanBatchData. That way we can > assert that any batch loaded within the read stream callback is > already unguarded/has no batch index page buffer pin still held. The > field isn't just for assertions, though. We also rely on it in one or > two places during release builds, which seemed clearer to me. For > example, we no longer require that amunguardbatch routines be > idempotent, which seems like a minor improvement. > > * Other work is broken out into its own commit, to better highlight > issues that aren't strictly related to the amgetbatch changes. > Specifically, a new commit adds the "don't unpin on rescan" behavior. > These changes, which first appeared in the main amgetbatch commit, aim > to minimize regressions with cached workloads (this was discussed > briefly upthread, plus Andres and I discussed it over IM some weeks > back). > > * Replaced what was previously an assert with an equivalent > pg_assume(), within heapam_index_getnext_scanbatch_pos -- which is a > very hot function. We now "pg_assume(batchringbuf->headBatch =3D=3D > scanPos->batch)", allowing the compiler to exploit this existing > invariant. This seems to help the compiler generate better code, > presumably by allowing more efficient register allocation. I haven't > looked at the disassembly just yet, but it helps with certain > microbenchmarks enough to matter. > > * The "Add UnlockBufferGetLSN utility function" commit now uses an > atomic operation (in the common case where it actually drops the index > page's pin), just like UnlockReleaseBuffer itself following Andres' > commit f39cb8c011062d65e146c1e9d1aae221e96d8320 from today. > > I completed this item quickly, after pulling from master a little > while ago. The expectation for UnlockBufferGetLSN is that it be at > least as good as a BufferGetLSNAtomic followed by a > UnlockReleaseBuffer, so IMV it made sense to include this last minute > addition in v18. > > * The "don't wait" patches are no longer included, since Andres > committed them to the master branch just now. Thanks to both you > (Andres) and Melanie for taking care of this very thorny issue! > > I believe I have now acted on all of Andres' feedback, with one minor > exception: Andres disliked how we use 'if (scan->xs_heapfetch)' to > determine whether to use the scan's batch cache (we don't want to use > it at all if the scan is ending). I just ran out of steam, so this > version doesn't address the problem. But I'm still tracking it as an > open item (I don't disagree with Andres, but this problem is slightly > awkward). Note that I *did* create a helper function here, so we at > least avoid having 2 copies of the loop that looks for a free cache > batch slot. > I am still reviewing / understanding the patch, a couple quick comments based on my review. 1. Looks like off-by-one in the for loop in the patch v18-0015-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch @@ -730,7 +711,39 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe) ioh->op_data.read.iov_length, ioh->op_data.read.offset); + for (int i =3D 0; i <=3D ioh->op_data.read.iov_length; i++, iov++) + io_size +=3D iov->iov_len; 2. NIT comment: _bt_endpoint return type changed to IndexScanBatch from bool in the patch v18-0005-Add-interfaces-that-enable-index-prefetching.patch. But the places it is returning false would be good to replace with NULL. Though false is treat as 0/NULL and no compiler errors, it improves the readability of the code. if (!BufferIsValid(btfirstbatch->buf)) { /* * Empty index. Lock the whole relation, as nothing finer to lock * exists. */ PredicateLockRelation(rel, scan->xs_snapshot); _bt_parallel_done(scan); return false; } Thanks, Satya --000000000000629c70064e23ab29 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Peter,



On Fri, Mar 27, 2026 at 5:36=E2=80=AFPM Peter Geoghegan= <pg@bowt.ie> wrote:
On Sun, Mar 22, 2026 at 9:14= =E2=80=AFPM Peter Geoghegan <pg@bowt.ie> wrote:
> V17 is attached. This addresses most of your feedback, but defers
> dealing with the trickier parts for now.

Attached is v18, which addresses the tricky issues I skipped in v17.

Highlights:

* New commit (the first) that adds a heapam_indexscan.c file, and
moves all of the relevant existing heapam_handler.c functions over to
this new file. This commit is strictly mechanical, and has no
functional impact.

* Another new commit (the second) adds the new slot-based index scan
interface, and pushes the VM lookups that currently take place in
nodeIndexonlyScan.c into heapam. This includes the instrumentation
changes that were previously in their own small patch. It also
includes the VISITED_PAGES_LIMIT stuff -- there's no longer any commit<= br> that breaks VISITED_PAGES_LIMIT, nor is there a later commit that
subsequently restores the functionality.

* We now extract one key piece of infrastructure that other table AMs
can now reuse to manage their own index scan batches: a new inline
function called tableam_util_fetch_next_batch now appears as an
indexbatch.h inline function.

This function (which appeared under a different name in v17) manages
deciding whether to call amgetbatch (performing the amgetbatch call
when needed) or to simply return the next batch in line (because such
a batch already exists/is already loaded in the ring buffer from
earlier). It also detects and handles cross-batch changes in scan
direction, and sets/reads the fields that track when a batch is
definitely the last one in the given scan direction.

I tried to go further by also placing the bulk of the code in
heapam_index_getnext_scanbatch_pos in indexbatch.h, but ultimately
concluded that it wasn't worth it. There isn't much code in
heapam_index_getnext_scanbatch_pos to generalize. Some of that code
coordinates quite directly with the heapam read stream callback.

* Added an "isGuarded" field to IndexScanBatchData. That way we c= an
assert that any batch loaded within the read stream callback is
already unguarded/has no batch index page buffer pin still held. The
field isn't just for assertions, though. We also rely on it in one or two places during release builds, which seemed clearer to me. For
example, we no longer require that amunguardbatch routines be
idempotent, which seems like a minor improvement.

* Other work is broken out into its own commit, to better highlight
issues that aren't strictly related to the amgetbatch changes.
Specifically, a new commit adds the "don't unpin on rescan" b= ehavior.
These changes, which first appeared in the main amgetbatch commit, aim
to minimize regressions with cached workloads (this was discussed
briefly upthread, plus Andres and I discussed it over IM some weeks
back).

* Replaced what was previously an assert with an equivalent
pg_assume(), within heapam_index_getnext_scanbatch_pos -- which is a
very hot function.=C2=A0 We now "pg_assume(batchringbuf->headBatch = =3D=3D
scanPos->batch)", allowing the compiler to exploit this existing invariant. This seems to help the compiler generate better code,
presumably by allowing more efficient register allocation. I haven't looked at the disassembly just yet, but it helps with certain
microbenchmarks enough to matter.

* The "Add UnlockBufferGetLSN utility function" commit now uses a= n
atomic operation (in the common case where it actually drops the index
page's pin), just like UnlockReleaseBuffer itself following Andres'=
commit f39cb8c011062d65e146c1e9d1aae221e96d8320 from today.

I completed this item quickly, after pulling from master a little
while ago. The expectation for UnlockBufferGetLSN is that it be at
least as good as a BufferGetLSNAtomic followed by a
UnlockReleaseBuffer, so IMV it made sense to include this last minute
addition in v18.

* The "don't wait" patches are no longer included, since Andr= es
committed them to the master branch just now. Thanks to both you
(Andres) and Melanie for taking care of this very thorny issue!

I believe I have now acted on all of Andres' feedback, with one minor exception: Andres disliked how we use 'if (scan->xs_heapfetch)' = to
determine whether to use the scan's batch cache (we don't want to u= se
it at all if the scan is ending). I just ran out of steam, so this
version doesn't address the problem. But I'm still tracking it as a= n
open item (I don't disagree with Andres, but this problem is slightly awkward). Note that I *did* create a helper function here, so we at
least avoid having 2 copies of the loop that looks for a free cache
batch slot.

I am still reviewing / unde= rstanding the patch, a couple quick comments based on my review.
=
1. Looks like off-by-one in the for loop in the patch=C2=A0v= 18-0015-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch
@@ -730,7 +711,39 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struc= t io_uring_sqe *sqe)
=C2=A0 ioh->op_data.read.iov_length,
= =C2=A0 ioh->op_data.read.offset);
=C2=A0
+ for (int i = =3D 0; i <=3D ioh->op_data.read.iov_length; i++, iov++)
+ = io_size +=3D iov->iov_len;=C2=A0

2. NIT commen= t: _bt_endpoint return type changed to IndexScanBatch from bool in the patc= h v18-0005-Add-interfaces-that-enable-index-prefetching.patch.
Bu= t the places it is returning false would be=C2=A0 good to replace with NULL= . Though false is treat as 0/NULL and no compiler errors, it improves the r= eadability of the code.

=C2=A0if (!BufferIsValid(b= tfirstbatch->buf))
{
/*
* Empty index. Lock the whole rel= ation, as nothing finer to lock
* exists.
*/
PredicateLock= Relation(rel, scan->xs_snapshot);
_bt_parallel_done(scan);
ret= urn false;
}

Thanks,
Satya


--000000000000629c70064e23ab29--