Feedback-ID: id4a34324:Fastmail
Date: Wed, 1 Apr 2026 19:20:40 -0400
From: Andres Freund <andres@anarazel.de>
To: Melanie Plageman <melanieplageman@gmail.com>
Cc: pgsql-hackers@postgresql.org, Thomas Munro <thomas.munro@gmail.com>,
	Peter Geoghegan <pg@bowt.ie>, Tomas Vondra <tv@fuzzy.cz>,
	Nazir Bilal Yavuz <byavuz81@gmail.com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Message-ID: <b7cxivohlk7hfl6qcqxbutrpoukunawxvarr2g2o6jicwkyx5o@qtzdj36ihyz5>
References: <f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu>
 <CAAKRu_ZcJnnxgDQaXjuhd37bnc-jKARBU4EDi+LUqgs+ZjmrgQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="ik3l34cft6pr4f53"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: 
 <CAAKRu_ZcJnnxgDQaXjuhd37bnc-jKARBU4EDi+LUqgs+ZjmrgQ@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/b7cxivohlk7hfl6qcqxbutrpoukunawxvarr2g2o6jicwkyx5o%40qtzdj36ihyz5>
Precedence: bulk


--ik3l34cft6pr4f53
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Hi,


On 2026-03-31 16:59:14 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > 0005+0006:  Only increase distance when waiting for IO
> 
> In "aio: io_uring: Trigger async processing for large IOs" (0005), the
> first sentence of the commit message is incomplete.

Oops.


> Is there any reason for both the io size and inflight IOs threshold to
> be 4? If they should be the same, I think it would be better if this
> was a macro.

No, there's no real real they have to be the same.  It's just what a bit of
experimenting showed working independently for either.


> This may not matter, but the old code checked in_flight_before > 5
> before incrementing if for the current IO. The new code counts it
> after pushing the current IO onto the submission list. So the new way
> is slightly more aggressive.

Hm. True.  Not sure it matters. I didn't really see a significant difference
for anything between 3 and 7, it was only outside of that that I saw worse
performance.

I have done more validation with the new cutoff value than with the old one,
so I'm ever so mildly inclined to use the value currently in the patch, but I
won't at all insist on it.


> >     Unfortuntely with io_uring the situation is more complicated, because
> >     io_uring performs reads synchronously during submission if the data is the
> >     kernel page cache.  This can reduce performance substantially compared to
> >     worker, because it prevents parallelizing the copy from the page cache.
> >     There is an existing heuristic for that in method_io_uring.c that adds a
> >     flag to the IO submissions forcing the IO to be processed asynchronously,
> >     allowing for parallelism.  Unfortunately the heuristic is triggered by the
> >     number of IOs in flight - which will never become big enough to tgrigger
> >     after using "needed to wait" to control how far to read ahead.
> >
> >     So 0005 expands the io_uring heuristic to also trigger based on the sizes
> >     of IOs - but that's decidedly not perfect, we e.g. have some experiments
> >     showing it regressing some parallel bitmap heap scan cases.  It may be
> >     better to somehow tweak the logic to only trigger for worker.
> 
> Trigger which logic only for worker, you mean only increasing the
> distance when waiting?

Yea.


> >     As is this has another issue, which is that it prevents IO combining in
> >     situations where it shouldn't, because right now using the distance to
> >     control both. See 0008 for an attempt at splitting those concerns.
> 
> Even if you can't combine into a single IO, it seems like a low
> distance is problematic because it degrades batching and causes us to
> have to call io_uring_enter for every block (I think).

I don't think it actually does change the situation around that significantly,
because we already end up with "too few" IOs once we hit the distance maximum,
as we'll submit another IO as soon as we can.

I think we will eventually need some logic to only start submitting again once
multiple IOs are possible. But that's another set of heuristics, so a project
for another day :)


In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage,
i.e. when actually doing IO, not just copying from the page cache.

I have seen it help due to reducing syscalls too, but the amount of batching
and/or combining seems to have a relatively low ceiling at which it stops
helping.


> Setting aside more complicated prefetching systems, what it seems like
> we are saying is that for all "miss" cases (not in SB) a distance of
> above 1 is advantageous (unless we are only doing 1 IO). I wonder if
> there is something hacky we can do like not decaying distance below
> io_combine_limit if there has been a recent miss or growing it up to
> at least io_combine_limit if we aren't getting all hits.

I think it's true that if IO execution was all that mattered, we would want a
bit more IO in flight at all time.  However looking ahead quite deeply also
has costs:

1) The resowner and private refcount mechanisms take more CPU cycles if you
   have more buffers pinned

2) The CPU cache hit ratio goes down if there's a longer time between copying
   data into s_b and consuming it

3) If you have a scan that won't be consumed to completion, you're wasting
   more the deeper you look ahead


This is actually not hard to show:

SET max_parallel_workers_per_gather = 0;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM pgbench_accounts WHERE abalance = 3;
(on a tree that has the explain stuff, but not the patches in this thread applied)

eic	time in ms
1       1326.525
2       1325.240
4       1335.073
8       1343.440
16      1346.189
32      1356.598
64      1398.326
128     1635.081
256     1674.685
512     1677.264
1000    1680.050

This one mainly shows 2) from above, I think, but the others are measurable in
other workloads.


> > 0007: Make read_stream_reset()/end() not wait for IO
> >
> >     This is a quite experimental, not really correct as-is, patch to avoid
> >     unnecessarily waiting for in-flight IO when read_stream_reset() is done
> >     while there's in-flight IO.  This is useful for things like nestloop
> >     antioins with quals on the inner side (without the qual we'd not trigger
> >     any readahead, as that's deferred in the index prefetching patch).
> >
> >     As-is this will leave IOs visible in pg_aios for a while, potentially
> >     until the backends exit. That's not right.
> 
> Separating the problems: the handle slot exhaustion seems like it
> could be solved by having the backend process discard IOs when it
> needs one and there isn't any.

I don't think it could lead to exhaustion of handles, pgaio_io_acquire() will
call pgaio_io_wait_for_free() which will wait for the oldest IO.


> The pg_aios view problems seem solvable with a flag on the IO like
> "DISCARDED". But the buffers staying pinned is different. It seems
> like you'll need the backend to process the discarded IOs at some
> point. Maybe it should do that before idling waiting for input?

The easiest way would be to actually leave the IOs registered with the
resowner and have it wait for completion at command or transaction end if not
already done.  But we currently don't really do that with resowners in the
!abort case. I'm not sure if anybody would mind doing differently here.

Another approach would be to do it in AtEOXact_Aio(), but that would mean the
IOs could hang around for a while.

A third approach could be to do one of the above, but add some additional that
go through in flight IOs and check if they completed, e.g. in
pgaio_io_acquire_nb().


> When discarding IOs, I don't really understand why the foreign IO
> path, doesn't just clear its own wait ref (not the buffer descriptor
> one) and move on -- instead you have it wait.

That'd be ok to do, I just didn't want to think about that more complicated
and less common case :)


> I haven't finished reviewing 0008 yet.

I've attached a version of what was 0008 split into two, one to introduce the
new helpers, one to introduce the separate combine distance.


I've pushed what was 0001 and 0002.  Will push the former 0003 shortly.


Greetings,

Andres Freund

--ik3l34cft6pr4f53
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment;
	filename="v4-0001-read_stream-Issue-IO-synchronously-while-in-fast-.patch"