MIME-Version: 1.0
References: <f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu>
 <CAAKRu_bfwBzg7=Zy88st6gBJf97Wkd3k=+m1ecApn=59SwmKSw@mail.gmail.com>
 <pj4kgtdrevvkfbmlri6p27belctxru7ytyprcb6v74c7zbh3l6@m4dcu2rljedv>
In-Reply-To: <pj4kgtdrevvkfbmlri6p27belctxru7ytyprcb6v74c7zbh3l6@m4dcu2rljedv>
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 3 Apr 2026 11:46:59 -0400
Message-ID: 
 <CAAKRu_b+tymZQc2vv__k8oPHsVQppg9QmaeQLiupZ3t2AO=z3A@mail.gmail.com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: pgsql-hackers@postgresql.org, Thomas Munro <thomas.munro@gmail.com>,
	Peter Geoghegan <pg@bowt.ie>, Tomas Vondra <tv@fuzzy.cz>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAAKRu_b%2BtymZQc2vv__k8oPHsVQppg9QmaeQLiupZ3t2AO%3Dz3A%40mail.gmail.com>
Precedence: bulk

On Thu, Apr 2, 2026 at 11:47=E2=80=AFAM Andres Freund <andres@anarazel.de> =
wrote:
>
> What do you think about the updated patch to achieve that that I posted?

here is some review on 0005 and 0006 earlier posted
concrete things:
-------
- I=E2=80=99d reorder stream=E2=86=92distance =3D=3D 0 in read_stream_look_=
ahead() (and
issue), I found myself asking why it wasn=E2=80=99t first

- I agree with bilal that
if (stream->pinned_buffers + stream->pending_read_nblocks >=3D
stream->max_pinned_buffers)
    return false;
should be in the other commit because it isn=E2=80=99t required with the
current code because of how distance is set and is more confusing.
wasn=E2=80=99t it an assert before?

- perhaps the new heuristic for allowing further look ahead if we are
building an IO and it isn=E2=80=99t big enough yet should be in its own com=
mit

- why have read_stream_pause() save combine_distance if it isn=E2=80=99t go=
ing
to zero it out?
- I think the comments you added to the read stream struct for
combine_distance and readahead_distance should indicate units (i.e.
blocks)

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

- can we move the fast path to where we found out we got a hit?

- should_issue_now has a lot of overlap with should_read_ahead which
makes it confusing to figure out how they are different, but I think
that is related to readahead_distance being in units of blocks and not
IOs (which I talk about later)

- There's also some assorted typos and comment awkwardness that I
assume you will clean up in the polish step
(e.g. "if we have an the process" -> "if we are in the process",
"oveflow" -> "overflow", The NB comment still says stream->distance =3D=3D
0 but should say stream->readahead_distance =3D=3D 0)

some more ambiguous stuff:
-------
> In my experiments the batching was primarily useful to allow to reduce th=
e
> command submission & interrupt overhead when interacting with storage, i.=
e.
> when actually doing IO, not just copying from the page cache.

It still seems to me that batching would help guarantee parallel
memcpys for workers when data is in the kernel buffer cache because if
the IO is quick, the same worker may pick up the next IO.
---
You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

> The resowner and private refcount mechanisms take more CPU cycles if you =
have
> more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind
---
>> I liked the idea of being more aggressive to do IO combining. What is
>> the reason for gradually increasing combine_distance, is it to not do
>> unnecessary IOs at the start?
>
> Yea. It'd perhaps not be too bad with the existing users, but it'd *reall=
y*
> hurt with index scan prefetching, because of query plans where we only co=
nsume
> part of the index scan (e.g. a nested loop antijoin).

You said we have to ramp up combine_distance because in index
prefetching when we don=E2=80=99t need all the blocks, it can hurt. But if =
we
ramp it up unconditionally (assuming we did IO), then I don=E2=80=99t see h=
ow
this would solve the problem as it will ramp up regardless after just
a few IOs anyway.
---
> I've been experimenting with going the other way, by having
> read_stream_should_look_ahead() do a check like

    /*
     * If we already have IO in flight, but are close enough to to the
     * distance limit that we would not start a fully sized IO, don't even
     * start a pending read until later.
     *
     * This avoids calling the call thing the next block callback in cases =
we
     * would not start the pending read anyway. For some users the work to
     * determine the next block is non-trivial, so we don't want to do so
     * earlier than necessary.
     *
     * A secondary benefit of this is that some callers use parallel worker=
s
     * with each their own read stream to process a global list of blocks, =
and
     * only calling the next block callback when ready to actually issue IO
     * makes it more likely for one backend to get consecutive blocks.
     */
    if (stream->pinned_buffers > 0 &&
        stream->pending_read_nblocks =3D=3D 0 &&
        stream->pinned_buffers + stream->combine_distance >=3D
stream->readahead_distance)
        return false;

So, as you say, with, for example, a large io_combine_limit and small
effective_io_concurrency, this would result in much lower IO
concurrency than we want. But, I think this speaks to the central
tension I see with the new combine_distance and readahead_distance: it
seems like readahead_distance should now be in units of IO and
combine_distance in units of blocks. If that were the case, this
heuristic wouldn't have a downside.

Obviously ramping readahead_distance up and down when it is in units
of IO becomes much more fraught though.

But our criteria for wanting to make bigger IOs is different than our
criteria for wanting more IOs in flight.
---
I think it is weird to have combine_distance only be relevant when
readahead_distance is low. You said:

> We could, but I don't think there would be a benefit in doing so. In my m=
ind,
> what combine_distance is used for, is to control how far to look ahead to
> allow for IO combining when the readahead_distance is too low to allow fo=
r IO
> combining.  But if pinned_buffers > 0, we already have another IO in flig=
ht,
> so the combine_distance mechanism doesn't have to kick in.

But it seems like for a completely uncached workload bigger IOs is
still beneficial. We may want to avoid reading ahead to get bigger IOs
when there is a chance we only need a few blocks. And, we want to
parallelize copies across workers, so we want a combine_distance that
is small enough that, relative to effective_io_concurrency, we can
split IOs across workers some. But, otherwise, we want maximally sized
IOs. In the medium term, perhaps we can try to use executor hints to
handle the former. And, assuming you think readahead_distance in IOs
is a non-starter, maybe we could use a heuristic that tries to balance
IOs by dividing readahead_distance by combine_distance when
combine_distance is sufficiently high or something like that.

- Melanie