MIME-Version: 1.0
References: <f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu>
In-Reply-To: <f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu>
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 31 Mar 2026 16:59:14 -0400
Message-ID: 
 <CAAKRu_ZcJnnxgDQaXjuhd37bnc-jKARBU4EDi+LUqgs+ZjmrgQ@mail.gmail.com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: pgsql-hackers@postgresql.org, Thomas Munro <thomas.munro@gmail.com>,
	Peter Geoghegan <pg@bowt.ie>, Tomas Vondra <tv@fuzzy.cz>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAAKRu_ZcJnnxgDQaXjuhd37bnc-jKARBU4EDi%2BLUqgs%2BZjmrgQ%40mail.gmail.com>
Precedence: bulk

On Tue, Mar 31, 2026 at 12:02=E2=80=AFPM Andres Freund <andres@anarazel.de>=
 wrote:
>
> 0005+0006:  Only increase distance when waiting for IO

In "aio: io_uring: Trigger async processing for large IOs" (0005), the
first sentence of the commit message is incomplete.
Is there any reason for both the io size and inflight IOs threshold to
be 4? If they should be the same, I think it would be better if this
was a macro.

This may not matter, but the old code checked in_flight_before > 5
before incrementing if for the current IO. The new code counts it
after pushing the current IO onto the submission list. So the new way
is slightly more aggressive.

0006 "(read_stream: Only increase distance when waiting for IO)" looks
good to me from a code perspective. I don't yet have ideas for
handling potential parallel bitmapheapscan regressions.

>     Unfortuntely with io_uring the situation is more complicated, because
>     io_uring performs reads synchronously during submission if the data i=
s the
>     kernel page cache.  This can reduce performance substantially compare=
d to
>     worker, because it prevents parallelizing the copy from the page cach=
e.
>     There is an existing heuristic for that in method_io_uring.c that add=
s a
>     flag to the IO submissions forcing the IO to be processed asynchronou=
sly,
>     allowing for parallelism.  Unfortunately the heuristic is triggered b=
y the
>     number of IOs in flight - which will never become big enough to tgrig=
ger
>     after using "needed to wait" to control how far to read ahead.
>
>     So 0005 expands the io_uring heuristic to also trigger based on the s=
izes
>     of IOs - but that's decidedly not perfect, we e.g. have some experime=
nts
>     showing it regressing some parallel bitmap heap scan cases.  It may b=
e
>     better to somehow tweak the logic to only trigger for worker.

Trigger which logic only for worker, you mean only increasing the
distance when waiting?

>     As is this has another issue, which is that it prevents IO combining =
in
>     situations where it shouldn't, because right now using the distance t=
o
>     control both. See 0008 for an attempt at splitting those concerns.

Even if you can't combine into a single IO, it seems like a low
distance is problematic because it degrades batching and causes us to
have to call io_uring_enter for every block (I think). At least when I
was experimenting with this, the syscall overhead seemed
non-negligible. It's also true that this meant the memcpys couldn't be
parallelized, but system call overhead also seems to have been a
factor.

Setting aside more complicated prefetching systems, what it seems like
we are saying is that for all "miss" cases (not in SB) a distance of
above 1 is advantageous (unless we are only doing 1 IO). I wonder if
there is something hacky we can do like not decaying distance below
io_combine_limit if there has been a recent miss or growing it up to
at least io_combine_limit if we aren't getting all hits.

> 0007: Make read_stream_reset()/end() not wait for IO
>
>     This is a quite experimental, not really correct as-is, patch to avoi=
d
>     unnecessarily waiting for in-flight IO when read_stream_reset() is do=
ne
>     while there's in-flight IO.  This is useful for things like nestloop
>     antioins with quals on the inner side (without the qual we'd not trig=
ger
>     any readahead, as that's deferred in the index prefetching patch).
>
>     As-is this will leave IOs visible in pg_aios for a while, potentially
>     until the backends exit. That's not right.

Separating the problems: the handle slot exhaustion seems like it
could be solved by having the backend process discard IOs when it
needs one and there isn't any. Or is that not work we want to do in a
hot path?

The pg_aios view problems seem solvable with a flag on the IO like
"DISCARDED". But the buffers staying pinned is different. It seems
like you'll need the backend to process the discarded IOs at some
point. Maybe it should do that before idling waiting for input?

When discarding IOs, I don't really understand why the foreign IO
path, doesn't just clear its own wait ref (not the buffer descriptor
one) and move on -- instead you have it wait.

I haven't finished reviewing 0008 yet.

> One thing that's really annoying around this is that we have no infrastru=
cture
> for testing that the heuristics keep working. It's very easy to improve o=
ne
> thing while breaking something else, without noticing, because everything
> keeps working.

Agreed that something here would be useful.

- Melanie