MIME-Version: 1.0
References: 
 <CAH2-Wzn7vqmt=qE_hDrOx4NETkUoCbdn74G1gswMXi1APUuYrA@mail.gmail.com>
 <23f490f4-8325-408c-91a0-a6757ab2441c@vondra.me>
 <CA+hUKG+WWr4-8TYemyU=ucQsNe6bUBN_Sq3mCnBoBtxaJ9w3ug@mail.gmail.com>
 <CA+hUKGLvL408O2ss6YZQsdupryvBqjhuej_hyOYU9fobhdHkTQ@mail.gmail.com>
 <CAN55FZ0y57uEK+Ts8s3NV9Gyg=YiAV-y610XJpfS+jdCh_7f5g@mail.gmail.com>
 <c333af4f-6448-4494-aaf7-9dea3f0afb2e@vondra.me>
 <CAH2-Wzk+ba5oHf1ghC34Z8eoCvum5yHGMxHeFCbC2ibcQ+J7uw@mail.gmail.com>
 <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me>
 <CAH2-Wzm8AOhY83jPBrPDOO6dauoE9kDcef=b6-4TFSv6AkiNog@mail.gmail.com>
 <8ed1d326-5c6e-476e-b3fd-30d3da210546@vondra.me>
 <38b865bd-2ae9-4a94-a788-6e3dc99ccd70@vondra.me>
 <CAH2-WzmER9kc4OtmkDh+h51QV=v6Yc5BGsJikwJHtucf1C1HWw@mail.gmail.com>
 <306fc8c0-c882-4602-86f5-a106b9ace603@vondra.me>
 <CAH2-Wz=0enySZ5g0k0BLY3tHRs=wyG=7yXDYP=Abt=6GM=7XkQ@mail.gmail.com>
 <152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me>
 <CAH2-WzkX2fwtiNOX4RrBR8=XKta999NM_5+ghTUnwUBkeyfcHQ@mail.gmail.com>
 <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me>
In-Reply-To: <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me>
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 5 Aug 2025 17:35:08 -0400
Message-ID: 
 <CAH2-WzkPh+L2u8_4jG=NgGgzFNqW7ZZhSxGb6mJR=2YdouL1_Q@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>
Cc: Nazir Bilal Yavuz <byavuz81@gmail.com>,
 Thomas Munro <thomas.munro@gmail.com>,
	Andres Freund <andres@anarazel.de>, Robert Haas <robertmhaas@gmail.com>,
	Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-WzkPh%2BL2u8_4jG%3DNgGgzFNqW7ZZhSxGb6mJR%3D2YdouL1_Q%40mail.gmail.com>
Precedence: bulk

On Tue, Aug 5, 2025 at 4:56=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote=
:
> Probably. It was hard to predict which values will be interesting, maybe
> we can pick some subset now. I'll start by just doing larger steps, I
> think. Maybe increase by 4x rather than 2x, that'll reduce the number of
> combinations a lot. Also, I plan to stick to fillfactor=3D20, it doesn't
> seem to have a lot of impact anyway.

I don't think that fillfactor matters all that much, either way. A low
setting provides a simple way of simulating "wide" heap tuples, but
that probably isn't going to make the crucial difference.

It's not like the TPC-C index I used in my own recent testing (which
showed that the complex patch was almost 3x faster than the simple
patch) has all that strong of a pg_stats.correlation. You can probably
come up with indexes/test cases where groups of related TIDs that each
point to the same heap block appear together, even though in general
the index tuple heap TIDs appear completely out of order. It probably
isn't even that different to a simple pgbench_accounts_pkey from a
prefetching POV, though, in spite of these rather conspicuous
differences. In time we might find that just using
pgbench_accounts_pkey directly works just as well for our purposes
(unsure of that, but seems possible).

> So what other index scan variations would you suggest to test? I can
> imagine e.g. IN () conditions with variable list length, maybe
> multi-column indexes, and/or skip scan cases. Any other ideas?

The only thing that's really interesting about IN() conditions is that
they provide an easy way to write a query that only returns a subset
of all index tuples from every leaf page read. You can get a similar
access pattern from other types of quals, but that's not quite as
intuitive.

I really don't think that IN() conditions are all that special.
They're perfectly fine as a way of getting this general access
pattern.

I like to look for and debug "behavioral inconsistencies". For
example, I have an open item in my notes (which I sent to you over IM
a short while ago) about a backwards scan that is significantly slower
than an "equivalent" forwards scan. This involves
pgbench_accounts_pkey. It's quite likely that the underlying problem
has nothing much to do with backwards scans. I suspect that the
underlying problem is a more general one, that could also be seen with
the right forwards scan test case.

In general, it might make the most sense to look for pairs of
similar-ish queries that are inconsistent in a way that doesn't make
sense intuitively, in order to understand and fix the inconsistency.
Since chances are that it's actually just some kind of performance bug
that accidentally doesn't happen in only one variant of the query.

I bet that there's at least a couple of not-that-noticeable
performance bugs, for example due to some hard to pin down issue with
prefetch distance getting out of hand. Possibly because the read
stream doesn't get to see contiguous requests for TIDs that point to
the same heap page, but does see it when things are slightly out of
order. Two different queries that have approximately the same accesses
should have approximately the same performance -- minor variations in
leaf page layout or heap page layout or scan direction shouldn't be
confounding.

> FWIW I'm not planning to keep testing simple vs complex patches. We've
> seen the complex patch can do much better in certain workloads cases,
> the fact that we can discover more such cases does not change much.
>
> I'm much more interested in benchmarking master vs. complex patch.

Great!

> > It'd also be good to just not test "sync" anymore, at some point. And
> > maybe to standardize on testing either "worker" or "io_uring" for most
> > individual tests. There's just too many tests right now.
> >
>
> Agreed.

Might also make sense to standardize on direct I/O when testing the
patch (but probably not when testing master). The fact that we can't
get any OS readahead is likely to be useful.

> > Andres recently told me that he isn't expecting to be able to simulate
> > read-ahead with direct I/O. It seems possible that read-ahead
> > eventually won't be used at all, which argues for the complex patch.
> >
>
> True, the complex patch could prefetch the leaf pages.

What I meant was that the complex patch can make up for the fact that
direct I/O presumably won't ever have an equivalent to simple
read-ahead. Just by having a very flexible prefetching implementation
(and without any special sequential access heuristics ever being
required).

> > BTW, I experimented with using READ_STREAM_USE_BATCHING (not
> > READ_STREAM_DEFAULT) in the complex patch. That's probably
> > deadlock-prone, but I suspect that it works well enough to get a good
> > sense of what is possible. What I saw (with that same TPC-C test
> > query) was that "I/O Timings" was about 10x lower, even though the
> > query runtime didn't change at all. This suggests to me that "I/O
> > Timings" is an independently interesting measure: getting it lower
> > might not visibly help when only one query runs, but it'll likely
> > still lead to more efficient use of available I/O bandwidth in the
> > aggregate (when many queries run at the same time).
> >
>
> Interesting. Does that mean we should try enabling batching in some
> cases? Or just that there's room for improvement?

I don't know what it means myself. I never got as far as even starting
to understand what it would take to make READ_STREAM_USE_BATCHING
work.

AFAIK it wouldn't be hard to make that work here at all, in which case
we should definitely use it. OTOH, maybe it's really hard. I just
don't know right now.

> Could we do the next_block callbacks in a way that make deadlocks
> impossible?
>
> I'm not that familiar with the batch mode - how would the deadlock even
> happen in index scans?

I have no idea. Maybe it's already safe. I didn't notice any problems
(but didn't look for them, beyond running my tests plus the regression
tests).

> I think the only way is to try reworking some of the index AMs to use
> the new interface. For some AMs (e.g. hash) it's going to be very
> similar to what you did with btree, because it basically works like a
> btree. For others (GiST/SP-GiST) it may be more work.

The main difficulty with GiST may be that we may be obligated to fix
existing (unfixed!) bugs that affect index-only scans. The master
branch is subtly broken, but we can't in good conscience ignore those
problems while making these kinds of changes.

> It doesn't need to be committable, just good enough to be reasonably
> certain it's possible.

That's what I have in mind, too. If we have support for a second index
AM, then we're much less likely to over-optimize for nbtree in a way
that doesn't really make sense.

> Understood, and I agree in principle. It's just that given the fuzziness
> I find it hard how it should look like.

I suspect that index AMs are much more similar for the purposes of
prefetching than they are in other ways.


--
Peter Geoghegan