MIME-Version: 1.0
References: 
 <CAH2-Wzn7vqmt=qE_hDrOx4NETkUoCbdn74G1gswMXi1APUuYrA@mail.gmail.com>
 <b8ab7410-0b96-495f-88e6-6b41db39d290@vondra.me>
 <CAH2-Wz==zkbOumyX-M4quqbX9GCLcjW_zXmdsaK37q-55rj_fQ@mail.gmail.com>
 <dc4c7ace-f3c4-4325-a63c-b1f9aa5541ba@vondra.me>
 <CAH2-Wz=G_HKYZ34-UkBMQjk7SCNtu9O5V4t=A0u82=WT-rtBuw@mail.gmail.com>
 <CA+TgmoaEvBnx2npWycGt1ChTe8m800wMiioUCspaSb0qzc3=Kg@mail.gmail.com>
 <CAH2-WznFwgU3AddTqnvJABX5xo-9upG6NiX+2s0eaFhFj6tRAg@mail.gmail.com>
 <CA+Tgmobav+-oR9-jJUGbHj3j7bhwPpz7qVkfr_9zUSF-kens9A@mail.gmail.com>
 <CAH2-WzkqnVGLEQ31W1vm8T_uzy-ma-6A8QL-C56=0QUqs12b=Q@mail.gmail.com>
 <CAH2-WznmrgwFShyKuKjF2v7M_Eid6VzKd+SRPjh4-y68T6uCDw@mail.gmail.com>
 <esftck6ayqkkdtzijd736oazhve577sp7hthnwouyg2stlwlqj@rmhohbhl7tuz>
 <57d0e292-73d5-4ab9-9855-110ee9cbd90a@vondra.me>
 <efac3238-6f34-41ea-a393-26cc0441b506@vondra.me>
 <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me>
In-Reply-To: <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me>
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 12 Jul 2025 19:50:48 -0400
Message-ID: 
 <CAH2-WzmVvU2KmKyq8sUeYXrc_roA5LfOgDE1-vdtorpk_M3DfA@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>
Cc: Andres Freund <andres@anarazel.de>, Robert Haas <robertmhaas@gmail.com>,
	Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>,
	Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-WzmVvU2KmKyq8sUeYXrc_roA5LfOgDE1-vdtorpk_M3DfA%40mail.gmail.com>
Precedence: bulk

On Thu, May 1, 2025 at 7:02=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote=
:
> There's two "fix" patches trying to make this work - it does not crash,
> and almost all the "incorrect" query results are actually stats about
> buffer hits etc. And that is expected to change with prefetching, not a
> bug. But then there are a bunch of explains where the number of index
> scans changed, e.g. like
>
> -         Index Searches: 5
> +         Index Searches: 4
>
> And that is almost certainly a bug.
>
> I haven't figured this out yet, and I feel a bit lost again :-(

For the benefit of other people reading this thread: I sent Tomas a
revised version of this "complex" patch this week, fixing all these
bugs. It only took me a few hours, and I regret not doing that work
sooner.

I also cleaned up nbtree aspects of the "complex" patch considerably.
The nbtree footprint was massively reduced:

17 files changed, 422 insertions(+), 685 deletions(-)

So there's a net negative nbtree code footprint. We're effectively
just moving things out of nbtree that are already completely
index-AM-generic. I think that the amount of code that can be removed
from nbtree (and other AMs that currently use amgettuple) will be even
higher if we go this way.

> The one real limitation of the simpler approach is that prefetching is
> limited to a single leaf page - we can't prefetch from the next one,
> until the scan advances to it. But based on experiments comparing this
> simpler and the "complex" approach, I don't think that really matters
> that much. I haven't seen any difference for regular queries.

Did you model/benchmark it?

> The one case where I think it might matter is queries with array keys,
> where each array key matches a single tuple on a different leaf page.
> The complex patch might prefetch tuples for later array values, while
> the simpler patch won't be able to do that. If an array key matches
> multiple tuples, the simple patch can prefetch those just fine, of
> course. I don't know which case is more likely.

We discussed this in Montreal, but I'd like to respond to this point
again on list:

I don't think that array keys are in any way relevant to the design of
this patch. Nothing I've said about this project has anything to do
with array keys, except when I was concerned about specific bugs in
the patch. (Bugs that I've now fixed in a way that is wholly confined
to nbtree.)

The overarching goal of my work on nbtree array scans was to make them
work just like other scans to the maximum extent possible. Array scans
"where each array key matches a single tuple on a different leaf page"
are virtually identical to any other scan that'll return only one or
two tuples from each neighboring page. You could see a similar pattern
with literally any kind of key.

Again, what I'm concerned about is coming up with a design that gives
scans maximum freedom to reorder work (not necessarily in the first
committed version), so that we can keep the read stream busy by giving
it sufficiently many heap pages to read: a truly adaptive design, that
weighs all relevant costs. Sometimes that'll necessitate eagerly
reading leaf pages. There is nothing fundamentally complicated about
that idea. Nothing in index AMs cares about how or when heap accesses
take place.

Again, it just *makes sense* to centralize the code that controls the
progress of ordered/amgettuple scans. Every affected index AM is
already doing virtually the same thing as each other. They're all
following the rules around index locking/pinning for amgettuple [1].
Individual index AMs are *already* required to read leaf pages a
certain way, in a certain order *relative to the heap accesses*. All
for the benefit of scan correctness (to avoid breaking things in a way
that relates to heapam implementation details).

Why wouldn't we want to relieve all AMs of that responsibility?
Leaving it up to index AMs has resulted in subtle bugs [2][3], and
AFAICT has no redeeming quality. If affected index AMs were *forced*
to do *exactly* the same thing as each other (not just *oblidged* to
do *almost* the same thing), it would make life easier for everybody.

[1] https://www.postgresql.org/docs/current/index-locking.html
[2] https://commitfest.postgresql.org/patch/5721/
[3] https://commitfest.postgresql.org/patch/5542/
--=20
Peter Geoghegan