MIME-Version: 1.0
References: 
 <CAH2-Wzn7vqmt=qE_hDrOx4NETkUoCbdn74G1gswMXi1APUuYrA@mail.gmail.com>
 <efac3238-6f34-41ea-a393-26cc0441b506@vondra.me>
 <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me>
 <CAH2-WzmVvU2KmKyq8sUeYXrc_roA5LfOgDE1-vdtorpk_M3DfA@mail.gmail.com>
 <ec4d8c7b-bbc7-4805-919f-8fdeb11b3d34@vondra.me>
 <64c8b824-6203-46a3-b045-5e95b796feee@vondra.me>
 <CAH2-Wzm-u6b4gDbLNP=1pkfqJbEyPyey9M-8wG0C+QOTit963Q@mail.gmail.com>
 <03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me>
 <aa46af80-5219-47e6-a7d0-7628106965a6@vondra.me>
 <CA+hUKG+P0RnG_c4vL6=d8ACwpmux5ZF91FO4UpX8PDu1WqEg9w@mail.gmail.com>
 <23f490f4-8325-408c-91a0-a6757ab2441c@vondra.me>
 <CA+hUKG+WWr4-8TYemyU=ucQsNe6bUBN_Sq3mCnBoBtxaJ9w3ug@mail.gmail.com>
 <CA+hUKGLvL408O2ss6YZQsdupryvBqjhuej_hyOYU9fobhdHkTQ@mail.gmail.com>
 <CAN55FZ0y57uEK+Ts8s3NV9Gyg=YiAV-y610XJpfS+jdCh_7f5g@mail.gmail.com>
 <c333af4f-6448-4494-aaf7-9dea3f0afb2e@vondra.me>
 <CAH2-Wzk+ba5oHf1ghC34Z8eoCvum5yHGMxHeFCbC2ibcQ+J7uw@mail.gmail.com>
 <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me>
In-Reply-To: <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me>
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 22 Jul 2025 17:35:26 -0400
Message-ID: 
 <CAH2-Wzm8AOhY83jPBrPDOO6dauoE9kDcef=b6-4TFSv6AkiNog@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>
Cc: Nazir Bilal Yavuz <byavuz81@gmail.com>,
 Thomas Munro <thomas.munro@gmail.com>,
	Andres Freund <andres@anarazel.de>, Robert Haas <robertmhaas@gmail.com>,
	Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-Wzm8AOhY83jPBrPDOO6dauoE9kDcef%3Db6-4TFSv6AkiNog%40mail.gmail.com>
Precedence: bulk

On Tue, Jul 22, 2025 at 4:50=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrot=
e:
> > Obviously, whatever advantage that the "complex" patch has is bound to
> > be limited to cases where index characteristics are naturally the
> > limiting factor. For example, with the pgbench_accounts_pkey table
> > there are only ever 6 distinct heap blocks on each leaf page. I bet
> > that your "linear" test more or less looks like that, too.
> >
>
> Yes. It's definitely true we could construct examples where the complex
> patch beats the simple one for this reason.

It's literally the only possible valid reason why the complex patch could w=
in!

The sole performance justification for the complex patch is that it
can prevent the heap prefetching from getting bottlenecked on factors
tied to physical index characteristics (when it's possible in
principle to avoid getting bottlenecked in that way). Unsurprisingly,
if you assume that that'll never happen, then yeah, the complex patch
has no performance advantage over the simple one.

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

> And I believe some of those
> examples could be quite realistic, even if not very common (like when
> very few index tuples fit on a leaf page).

I don't think cases like that matter very much at all. The only thing
that *really* matters on the index AM side is the logical/physical
correlation. Which your testing seems largely unconcerned with.

> However, I'm not sure the pgbench example with only 6 heap blocks per
> leaf is very significant. Sure, the simple patch can't prefetch TIDs
> from the following leaf, but AFAICS the complex patch won't do that
> either.

Why not?

> Not because it couldn't, but because with that many hits the
> distance will drop to ~1 (or close to it). (It'll probably prefetch a
> couple TIDs from the next leaf at the very end of the page, but I don't
> think that matters overall.)

Then why do your own test results continue to show such a big
advantage for the complex patch, over the simple patch?

> I'm not sure what prefetch distances will be sensible in queries that do
> other stuff. The queries in the benchmark do just the index scan, but if
> the query does something with the tuple (in the nodes on top), that
> shortens the required prefetch distance. Of course, simple queries will
> benefit from prefetching far ahead.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

> Thanks. I wonder how difficult would it be to add something like this to
> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
> count distinct blocks, right? Seems quite useful.

I agree that that would be quite useful.

> > Did you restore the distance for the "complex" patch, too? I think
> > that it might well matter there too.
> >
>
> No, I did not. I did consider it, but it seemed to me it can't really
> make a difference (for these data sets), because each leaf has ~300
> items, and the patch limits the prefetch to 64 leafs. That means it can
> prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
> should be good enough for eic=3D1000. It should never hit stream reset.

It looks like the complex patch can reset the read stream for a couple
of reasons, which I don't fully understand right now.

I'm mostly thinking of this stuff:

            /*
             * If we advanced to the next batch, release the batch we no
             * longer need. The positions is the "read" position, and we ca=
n
             * compare it to firstBatch.
             */
            if (pos->batch !=3D scan->batchState->firstBatch)
            {
                batch =3D INDEX_SCAN_BATCH(scan, scan->batchState->firstBat=
ch);
                Assert(batch !=3D NULL);

                /*
                 * XXX When advancing readPos, the streamPos may get behind=
 as
                 * we're only advancing it when actually requesting heap
                 * blocks. But we may not do that often enough - e.g. IOS m=
ay
                 * not need to access all-visible heap blocks, so the
                 * read_next callback does not get invoked for a long time.
                 * It's possible the stream gets so mucu behind the positio=
n
                 * gets invalid, as we already removed the batch. But that
                 * means we don't need any heap blocks until the current re=
ad
                 * position - if we did, we would not be in this situation =
(or
                 * it's a sign of a bug, as those two places are expected t=
o
                 * be in sync). So if the streamPos still points at the bat=
ch
                 * we're about to free, just reset the position - we'll set=
 it
                 * to readPos in the read_next callback later.
                 *
                 * XXX This can happen after the queue gets full, we "pause=
"
                 * the stream, and then reset it to continue. But I think t=
hat
                 * just increases the probability of hitting the issue, it'=
s
                 * just more chance to to not advance the streamPos, which
                 * depends on when we try to fetch the first heap block aft=
er
                 * calling read_stream_reset().
                 */
                if (scan->batchState->streamPos.batch =3D=3D
scan->batchState->firstBatch)
                    index_batch_pos_reset(scan, &scan->batchState->streamPo=
s);

> > Isn't the obvious explanation that the complex patch benefits from
> > being able to prefetch without being limited by index
> > characteristics/leaf page boundaries, while the simple patch doesn't?
> >
>
> That's a valid interpretation, yes. Although the benefit comes mostly

The benefit comes mostly from....?

> Yes, there's some similarity. Attached is the script I use to create the
> tables and load the data.

Another issue with the testing that biases it against the complex
patch: heap fill factor is set to only 25 (but you use the default
index fill-factor).

> The "linear" is a table with a simple sequence of values (0 to 100k).
> More or less - the value is a floating point, and there are 10M rows.
> But you get the idea.
>
> The "linear_X" variants mean the value has a noise of X% of the range.
> So with "linear_1" you get the "linear" value, and then random(0,1000),
> with normal distribution.

I don't get why this is helpful to test, except perhaps as a general smoke =
test.

If I zoom into any given "linear_1" leaf page, I see TIDs that appear
in an order that isn't technically uniformly random order, but is
fairly close to it. At least in a practical sense. At least for the
purposes of prefetching.

For example:

pg@regression:5432 [104789]=3D# select
  itemoffset,
  htid
from
  bt_page_items('linear_1_a_idx', 4);
=E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=90
=E2=94=82 itemoffset =E2=94=82   htid    =E2=94=82
=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=BC=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=A4
=E2=94=82          1 =E2=94=82 =E2=88=85         =E2=94=82
=E2=94=82          2 =E2=94=82 (10,18)   =E2=94=82
=E2=94=82          3 =E2=94=82 (463,9)   =E2=94=82
=E2=94=82          4 =E2=94=82 (66,8)    =E2=94=82
=E2=94=82          5 =E2=94=82 (79,9)    =E2=94=82
=E2=94=82          6 =E2=94=82 (594,7)   =E2=94=82
=E2=94=82          7 =E2=94=82 (289,13)  =E2=94=82
=E2=94=82          8 =E2=94=82 (568,2)   =E2=94=82
=E2=94=82          9 =E2=94=82 (237,2)   =E2=94=82
=E2=94=82         10 =E2=94=82 (156,10)  =E2=94=82
=E2=94=82         11 =E2=94=82 (432,9)   =E2=94=82
=E2=94=82         12 =E2=94=82 (372,17)  =E2=94=82
=E2=94=82         13 =E2=94=82 (554,6)   =E2=94=82
=E2=94=82         14 =E2=94=82 (1698,11) =E2=94=82
=E2=94=82         15 =E2=94=82 (389,6)   =E2=94=82
*** SNIP ***
=E2=94=82        288 =E2=94=82 (1264,5)  =E2=94=82
=E2=94=82        289 =E2=94=82 (738,16)  =E2=94=82
=E2=94=82        290 =E2=94=82 (1143,3)  =E2=94=82
=E2=94=82        291 =E2=94=82 (400,1)   =E2=94=82
=E2=94=82        292 =E2=94=82 (1157,10) =E2=94=82
=E2=94=82        293 =E2=94=82 (266,2)   =E2=94=82
=E2=94=82        294 =E2=94=82 (502,9)   =E2=94=82
=E2=94=82        295 =E2=94=82 (85,15)   =E2=94=82
=E2=94=82        296 =E2=94=82 (282,2)   =E2=94=82
=E2=94=82        297 =E2=94=82 (453,5)   =E2=94=82
=E2=94=82        298 =E2=94=82 (396,6)   =E2=94=82
=E2=94=82        299 =E2=94=82 (267,18)  =E2=94=82
=E2=94=82        300 =E2=94=82 (733,15)  =E2=94=82
=E2=94=82        301 =E2=94=82 (108,8)   =E2=94=82
=E2=94=82        302 =E2=94=82 (356,16)  =E2=94=82
=E2=94=82        303 =E2=94=82 (235,10)  =E2=94=82
=E2=94=82        304 =E2=94=82 (812,18)  =E2=94=82
=E2=94=82        305 =E2=94=82 (675,1)   =E2=94=82
=E2=94=82        306 =E2=94=82 (258,13)  =E2=94=82
=E2=94=82        307 =E2=94=82 (1187,9)  =E2=94=82
=E2=94=82        308 =E2=94=82 (185,2)   =E2=94=82
=E2=94=82        309 =E2=94=82 (179,2)   =E2=94=82
=E2=94=82        310 =E2=94=82 (951,2)   =E2=94=82
=E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=98
(310 rows)

There's actually 55,556 heap blocks in total in the underlying table.
So clearly there is some correlation here. Just not enough to ever
matter very much to prefetching. Again, the sole test case that has
that quality to it is the "linear" test case.

--=20
Peter Geoghegan