MIME-Version: 1.0
References: <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
 <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
 <CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
 <CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
 <CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
 <t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
 <bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh>
 <CAH2-Wz=9Wc7T6xMxynHNu6majG_Q+=1v_OXCyYC-PbvagsaTrQ@mail.gmail.com>
 <tshb2mmworpxvtmhehr45mhdvvp7wsceqbf5iycv7tisn73nq4@2ewrw6equ3cj>
 <CAH2-Wz=r13YutKUoz+dt8hJQGQzrqO2U6A3XY=zScDd=qP+odA@mail.gmail.com>
 <62qc7j3mvsyz6ucd7xh7pv7w3u7rhztevsmrzsig7fyzv6yvol@uoyjq4eelcnz>
In-Reply-To: <62qc7j3mvsyz6ucd7xh7pv7w3u7rhztevsmrzsig7fyzv6yvol@uoyjq4eelcnz>
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Mar 2026 18:02:22 -0400
Message-ID: 
 <CAH2-Wzk25zLGP8NGUP+RwGRL_Gq3fA254XCCuLKR=uEt9P7HBg@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>,
 Alexandre Felipe <o.alexandre.felipe@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-Wzk25zLGP8NGUP%2BRwGRL_Gq3fA254XCCuLKR%3DuEt9P7HBg%40mail.gmail.com>
Precedence: bulk

On Thu, Mar 26, 2026 at 5:47=E2=80=AFPM Andres Freund <andres@anarazel.de>
wrote:> > I must admit I'm unsure how to evaluate the maximum number
of batches.
> > It can make sense to pursue diminishing returns. But up to what point,
> > and according to what principle?
>
> I think the theoretical amount of required IO concurrency can be calculat=
ed
> based on the storage latency and IOPS.  IIRC it is
>
> iops_qd1 =3D (1000 / latency_ms)
> queue_depth =3D IOPS / iops_qd1
> queue_depth =3D IOPS / (1000 / latency_ms)

> So, to be able to fully utilize current hardware with one query, we need =
to be
> able to reach queue depth in the low hundreds, in the case of striped che=
ap
> cloud SSDs. That's when a backend *just* does IO, nothing else.

That sounds like a useful starting point. But empirically, as far as I
can tell, the relationship between query latency and how close you are
to fully saturating I/O is not linear, or anything like it. Maybe it's
an S-curve?

It seems that a fully I/O-bound query isn't remotely close to twice as
fast as the same query when it is restricted to using only half the
number of batches (half the number required to reach that saturation
point). OTOH, using more batches than strictly necessary usually isn't
much of a problem. So I don't think that we can rely on a precise
formula, even if we're willing to make fixed assumptions about data
layout (which we're not).

If there were a good enough reason for index prefetching to use an
unbounded number of batches, we could surely figure out a way to
support that requirement. It'd be messy, and relatively hard to test.
And I'd worry a bit about there being zero backstop for index-only
scans. But it can be done. If we did things that way (which doesn't
seem like a good idea right now), we wouldn't have to model I/O
saturation at all. Which tbh makes me wonder if that kind of modelling
has much practical use either way.

> Something like an index scan, will have its own limit to how much it can
> process in a second. If we can only do 100k IOPS while searching the inde=
x,
> fetching the heap tuples and processing them, we don't need to support th=
e
> queue depths to support doing 1M IOPS within one backend.
>
> That's something that can presumably be quite easily experimentally
> ballparked:
>
> A fully cached, completely uncorrelated, index scan seems to be able to f=
etch
> about 1.5M page fetches on my ~6 YO server CPU with turbo boost disabled,=
 when
> never looking at the results (i.e. using OFFSET) or immediately filtering=
 away
> the row.

> So I'd guess the limit on newer CPUs in SKUs optimized for clock
> speed and boost enabled, is north of 2.5M pages/sec, higher than I'd have
> thought!  That's without doing any IO though.

We've done good work on nbtree's ability to avoid provably unnecessary
work in recent years; see _bt_set_startikey. What that means is that
the majority of the index scans used to test the patch probably have
_bt_readpage calls that spend most of their time simply collecting all
of the TIDs from the leaf page, without any scan key overhead (barring
an initial precheck within _bt_set_startikey once per _bt_readpage, to
prove that the optimization is safe).

With large posting list tuples, we'll do even less work, since they're
just an array of ItemPointerData.

> With correlated scans the limit is much lower, maybe 150k, just because
> there's so many more tuples per page (and processing them trivially becom=
es
> the bottleneck).
>
>
> So, to support actually utilizing the full IO IO capability, we need to a=
llow
> for enough batches to keep a few hundred IOs in flight at the very extrem=
e
> end.  I'd assume you have a much better idea to how many batches that
> translates to?

I can give you a range. The problem is that it's a range starting from
"absurdly optimistic" through to "absurdly pessimistic". Neither
extreme is very unlikely (there's a wide natural variation in
workloads), and it's hard to argue usefully about what will be true in
most cases. In short, I can tell you plenty, but nothing that seems
particularly useful for determining how many batches we should cap the
ring buffer at.

I don't think there's anything fundamentally objectionable about our
deriving the current maximum of 64 through trial and error. I assume
that INDEX_SCAN_MAX_BATCHES must be constrained to a low-ish power of
two so that the ring buffer maintenance routines avoid DIV
instructions (from the use of a modulo operator that the compiler
cannot optimize into a bitwise AND). There just aren't that many
integers that even qualify as candidates!

I'm pretty sure that 32 is likely too low (though it's hard to tell
with buffered I/O on a fast local SSD). 128 might still be too low in
extreme corner cases involving high latency and few matches per batch
(though I doubt it). 256 seems too implausibly high to ever make sense
(but I've been wrong before).

--
Peter Geoghegan