AIO / read stream heuristics adjustments for index prefetching

public inbox for [email protected]  
help / color / mirror / Atom feed

AIO / read stream heuristics adjustments for index prefetching
23+ messages / 3 participants
[nested] [flat]

* AIO / read stream heuristics adjustments for index prefetching
@ 2026-03-31 16:02 Andres Freund <[email protected]>
  2026-03-31 19:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-01 14:52 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  0 siblings, 4 replies; 23+ messages in thread

From: Andres Freund @ 2026-03-31 16:02 UTC (permalink / raw)
  To: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Melanie Plageman <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

The index prefetching patchset [1] contains a few adjustments to the read
stream logic for readahead.  It seemed better to discuss them separately than
in that already very large thread.

The first two patches are also a dependency of the explain read stream
patches [2].


There are two main areas that prefetching of table data during an index scan
is more sensitive to than existing read stream users:

1) Prefetching for index scans is much more sensitive to doing too aggressive
   read ahead due to plans that involve running index scans to partial
   completion, rather than full completion. Consider e.g. nestloop antijoins
   or such, where the scan on the inner side will be started but often not
   completed.  If we unnecessarily read ahead too aggressively, a lot of IO
   could be wasted.

   While it's of course possible to have partially consumed read streams with
   sequential scans or bitmap heap scans, it's not as common / cost sensitive.

   For seqscans it likely mostly happens with a LIMIT above the seqscan, but
   that probably won't be happening many times within a query on a table of
   any size.

   For bitmap heap scans it's not as common because the startup cost, i.e. the
   building of the bitmap, is far from cheap, doing that over and over does
   not make a lot of sense.


2) Prefetching for index scans is much more likely to have complicated mixes
   of hits and misses.

   Whereas a seqscan or a bitmap heap scan accesses each table block exactly
   once, with index scans its very common to have repeated accesses to some
   table blocks, while still having misses on other blocks.  This means that
   index scans are more sensitive to patterns of hits and misses decreasing
   the readahead distance so much that we don't do aggressive enough readahead
   to avoid waiting for IO anymore.

   While more pronounced with index prefetching, it was already an issue with
   the existing users, particularly for bitmap heap scans. In fact, a similar
   patch to what's included here was first discussed somewhere around the BHS
   prefetching work.


There's a few sets of changes here:

0001+0002:  Return whether WaitReadBuffers() needed to wait

    The first patch allows pgaio_wref_check_done() to work more reliably with
    io_uring. Until now it only was able to return true if userspace already
    had consumed the kernel's completion event, but returned false otherwise.
    That's not really incorrect, just suboptimal.

    The second patch returns whether WaitReadBuffers() needed to wait for
    IO. This is useful for a) instrumentation like in [2] and b) to provide
    information to the read_stream heuristics to control how aggressive to
    perform read ahead.


0003:  read_stream: Issue IO synchronously while in fast path

    When read stream is in fast path mode (where it short-circuits the read
    ahead logic, to reduce CPU overhead in s_b resident workloads) and
    encounters a miss, we until now performed the read asynchronously.

    Unfortunately, with worker, that can lead to slowdowns, because
    dispatching to workers has a latency impact. When doing "real" readahead,
    that's a price worth paying, because the latency should be hidden by
    issuing the reads early enough. But when just coming out of fast path
    mode, we're not ahead of what's needed, so the dispatch latency can't be
    hidden.

    We already have infrastructure to mark IOs to be executed
    synchronously. So we just need to use that here.


0004:  read_stream: Prevent distance from decaying too quickly

    This, quite simple, patch reduces issue 2) from above, by preventing the
    look-ahead distance from being reduced for #maximum lookahead distance
    blocks after each miss.  While this may seem overly aggressive, a single
    effectively synchronous read can take a long time compared to the CPU time
    needed for processing pages hits.  On cloud storage the IO latency is
    somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few
    heap_hot_search_buffer()s on 1000s of pages within 1 ms.

    While this one is my patch, several others have written variations of it
    before. We should probably have committed one already.


    There are two minor questions here:
    - Should read_stream_pause()/read_stream_resume() restore the "holdoff"
      counter?  I doubt it matters for the prospective user, since it will
      only be used when the lookahead distance is very large.

    - For how long to hold off distance reductions?  Initially I was torn
      between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
      cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
      think the former makes more sense, as we otherwise won't allow for far
      enough readahead when doing IO combining, and it does seem to make sense
      to hold off decay for long enough that the maximum lookahead could not
      theoretically allow us to start an IO.


0005+0006:  Only increase distance when waiting for IO

    Until now we have increased the read ahead distance whenever there we
    needed to do IO (doubling the distance every miss). But that will often be
    way too aggressive, with the IO subsystem being able to keep up with a
    much lower distance.

    The idea here is to use information about whether we needed to wait for IO
    before returning the buffer in read_stream_next_buffer() to control
    whether we should increase the readahead distance.

    This seems to work extremely well for worker.

    Unfortuntely with io_uring the situation is more complicated, because
    io_uring performs reads synchronously during submission if the data is the
    kernel page cache.  This can reduce performance substantially compared to
    worker, because it prevents parallelizing the copy from the page cache.
    There is an existing heuristic for that in method_io_uring.c that adds a
    flag to the IO submissions forcing the IO to be processed asynchronously,
    allowing for parallelism.  Unfortunately the heuristic is triggered by the
    number of IOs in flight - which will never become big enough to tgrigger
    after using "needed to wait" to control how far to read ahead.

    So 0005 expands the io_uring heuristic to also trigger based on the sizes
    of IOs - but that's decidedly not perfect, we e.g. have some experiments
    showing it regressing some parallel bitmap heap scan cases.  It may be
    better to somehow tweak the logic to only trigger for worker.

    As is this has another issue, which is that it prevents IO combining in
    situations where it shouldn't, because right now using the distance to
    control both. See 0008 for an attempt at splitting those concerns.


0007: Make read_stream_reset()/end() not wait for IO

    This is a quite experimental, not really correct as-is, patch to avoid
    unnecessarily waiting for in-flight IO when read_stream_reset() is done
    while there's in-flight IO.  This is useful for things like nestloop
    antioins with quals on the inner side (without the qual we'd not trigger
    any readahead, as that's deferred in the index prefetching patch).

    As-is this will leave IOs visible in pg_aios for a while, potentially
    until the backends exit. That's not right.


0008: WIP: read stream: Split decision about look ahead for AIO and combining

    Until now read stream has used a single look-ahead distance to control
    lookahead for both IO combining and read-ahead. That's sub-optimal, as we
    want to do IO combining even when we don't need to do any readahead, as
    avoiding the syscall overhead is important to reduce CPU overhead when
    data is in the kernel page cache.

    This is a prototype for what it could look like to split those
    decisions. Thereby fixing the regression mentioned in 0006.



One thing that's really annoying around this is that we have no infrastructure
for testing that the heuristics keep working. It's very easy to improve one
thing while breaking something else, without noticing, because everything
keeps working.

I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which
would trigger providing information about the IOs and their schedule via the
the per-buffer-data mechanism.  That would allow test_aio's
read_stream_for_blocks() to return that information, which in turn could be
used to verify that we are doing IO combining and looking ahead far enough in
some situations.


Greetings,

Andres Freund

[1] https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
[2] https://postgr.es/m/6f541abf-f9e1-4830-93cc-4a849dbf2ecf%40vondra.me


^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-03-31 19:31 ` Melanie Plageman <[email protected]>
  3 siblings, 0 replies; 23+ messages in thread

From: Melanie Plageman @ 2026-03-31 19:31 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
>
> 0001+0002:  Return whether WaitReadBuffers() needed to wait
>
>     The first patch allows pgaio_wref_check_done() to work more reliably with
>     io_uring. Until now it only was able to return true if userspace already
>     had consumed the kernel's completion event, but returned false otherwise.
>     That's not really incorrect, just suboptimal.
>
>     The second patch returns whether WaitReadBuffers() needed to wait for
>     IO. This is useful for a) instrumentation like in [2] and b) to provide
>     information to the read_stream heuristics to control how aggressive to
>     perform read ahead.

These both look good to me except in 0001 you left an XXX in
pgaio_wref_check_done() that I think is the very thing that commit
does.

> 0003:  read_stream: Issue IO synchronously while in fast path

LGTM

> 0004:  read_stream: Prevent distance from decaying too quickly
>
>     There are two minor questions here:
>     - Should read_stream_pause()/read_stream_resume() restore the "holdoff"
>       counter?  I doubt it matters for the prospective user, since it will
>       only be used when the lookahead distance is very large.

I don't really understand this. We have to do this with distance
because we set it to 0 and use distance == 0 to indicate stream end.
read_stream_pause() doesn't set the distance_decay_holoff to 0. If you
mean, should we reset holdoff to its initial value, then I don't think
so. I imagine that users doing a lot of pause and resume may not have
high distance.

>     - For how long to hold off distance reductions?  Initially I was torn
>       between using "max_pinned_buffers" (Min(max_ios * io_combine_limit,
>       cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I
>       think the former makes more sense, as we otherwise won't allow for far
>       enough readahead when doing IO combining, and it does seem to make sense
>       to hold off decay for long enough that the maximum lookahead could not
>       theoretically allow us to start an IO.

I agree. 0004 LGTM otherwise.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-03-31 20:59 ` Melanie Plageman <[email protected]>
  2026-04-01 23:20   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  3 siblings, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-03-31 20:59 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
>
> 0005+0006:  Only increase distance when waiting for IO

In "aio: io_uring: Trigger async processing for large IOs" (0005), the
first sentence of the commit message is incomplete.
Is there any reason for both the io size and inflight IOs threshold to
be 4? If they should be the same, I think it would be better if this
was a macro.

This may not matter, but the old code checked in_flight_before > 5
before incrementing if for the current IO. The new code counts it
after pushing the current IO onto the submission list. So the new way
is slightly more aggressive.

0006 "(read_stream: Only increase distance when waiting for IO)" looks
good to me from a code perspective. I don't yet have ideas for
handling potential parallel bitmapheapscan regressions.

>     Unfortuntely with io_uring the situation is more complicated, because
>     io_uring performs reads synchronously during submission if the data is the
>     kernel page cache.  This can reduce performance substantially compared to
>     worker, because it prevents parallelizing the copy from the page cache.
>     There is an existing heuristic for that in method_io_uring.c that adds a
>     flag to the IO submissions forcing the IO to be processed asynchronously,
>     allowing for parallelism.  Unfortunately the heuristic is triggered by the
>     number of IOs in flight - which will never become big enough to tgrigger
>     after using "needed to wait" to control how far to read ahead.
>
>     So 0005 expands the io_uring heuristic to also trigger based on the sizes
>     of IOs - but that's decidedly not perfect, we e.g. have some experiments
>     showing it regressing some parallel bitmap heap scan cases.  It may be
>     better to somehow tweak the logic to only trigger for worker.

Trigger which logic only for worker, you mean only increasing the
distance when waiting?

>     As is this has another issue, which is that it prevents IO combining in
>     situations where it shouldn't, because right now using the distance to
>     control both. See 0008 for an attempt at splitting those concerns.

Even if you can't combine into a single IO, it seems like a low
distance is problematic because it degrades batching and causes us to
have to call io_uring_enter for every block (I think). At least when I
was experimenting with this, the syscall overhead seemed
non-negligible. It's also true that this meant the memcpys couldn't be
parallelized, but system call overhead also seems to have been a
factor.

Setting aside more complicated prefetching systems, what it seems like
we are saying is that for all "miss" cases (not in SB) a distance of
above 1 is advantageous (unless we are only doing 1 IO). I wonder if
there is something hacky we can do like not decaying distance below
io_combine_limit if there has been a recent miss or growing it up to
at least io_combine_limit if we aren't getting all hits.

> 0007: Make read_stream_reset()/end() not wait for IO
>
>     This is a quite experimental, not really correct as-is, patch to avoid
>     unnecessarily waiting for in-flight IO when read_stream_reset() is done
>     while there's in-flight IO.  This is useful for things like nestloop
>     antioins with quals on the inner side (without the qual we'd not trigger
>     any readahead, as that's deferred in the index prefetching patch).
>
>     As-is this will leave IOs visible in pg_aios for a while, potentially
>     until the backends exit. That's not right.

Separating the problems: the handle slot exhaustion seems like it
could be solved by having the backend process discard IOs when it
needs one and there isn't any. Or is that not work we want to do in a
hot path?

The pg_aios view problems seem solvable with a flag on the IO like
"DISCARDED". But the buffers staying pinned is different. It seems
like you'll need the backend to process the discarded IOs at some
point. Maybe it should do that before idling waiting for input?

When discarding IOs, I don't really understand why the foreign IO
path, doesn't just clear its own wait ref (not the buffer descriptor
one) and move on -- instead you have it wait.

I haven't finished reviewing 0008 yet.

> One thing that's really annoying around this is that we have no infrastructure
> for testing that the heuristics keep working. It's very easy to improve one
> thing while breaking something else, without noticing, because everything
> keeps working.

Agreed that something here would be useful.

- Melanie

^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-01 23:20   ` Andres Freund <[email protected]>
  2026-04-02 12:30     ` Re: AIO / read stream heuristics adjustments for index prefetching Nazir Bilal Yavuz <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Andres Freund @ 2026-04-01 23:20 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,


On 2026-03-31 16:59:14 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
> >
> > 0005+0006:  Only increase distance when waiting for IO
> 
> In "aio: io_uring: Trigger async processing for large IOs" (0005), the
> first sentence of the commit message is incomplete.

Oops.


> Is there any reason for both the io size and inflight IOs threshold to
> be 4? If they should be the same, I think it would be better if this
> was a macro.

No, there's no real real they have to be the same.  It's just what a bit of
experimenting showed working independently for either.


> This may not matter, but the old code checked in_flight_before > 5
> before incrementing if for the current IO. The new code counts it
> after pushing the current IO onto the submission list. So the new way
> is slightly more aggressive.

Hm. True.  Not sure it matters. I didn't really see a significant difference
for anything between 3 and 7, it was only outside of that that I saw worse
performance.

I have done more validation with the new cutoff value than with the old one,
so I'm ever so mildly inclined to use the value currently in the patch, but I
won't at all insist on it.


> >     Unfortuntely with io_uring the situation is more complicated, because
> >     io_uring performs reads synchronously during submission if the data is the
> >     kernel page cache.  This can reduce performance substantially compared to
> >     worker, because it prevents parallelizing the copy from the page cache.
> >     There is an existing heuristic for that in method_io_uring.c that adds a
> >     flag to the IO submissions forcing the IO to be processed asynchronously,
> >     allowing for parallelism.  Unfortunately the heuristic is triggered by the
> >     number of IOs in flight - which will never become big enough to tgrigger
> >     after using "needed to wait" to control how far to read ahead.
> >
> >     So 0005 expands the io_uring heuristic to also trigger based on the sizes
> >     of IOs - but that's decidedly not perfect, we e.g. have some experiments
> >     showing it regressing some parallel bitmap heap scan cases.  It may be
> >     better to somehow tweak the logic to only trigger for worker.
> 
> Trigger which logic only for worker, you mean only increasing the
> distance when waiting?

Yea.



> >     As is this has another issue, which is that it prevents IO combining in
> >     situations where it shouldn't, because right now using the distance to
> >     control both. See 0008 for an attempt at splitting those concerns.
> 
> Even if you can't combine into a single IO, it seems like a low
> distance is problematic because it degrades batching and causes us to
> have to call io_uring_enter for every block (I think).

I don't think it actually does change the situation around that significantly,
because we already end up with "too few" IOs once we hit the distance maximum,
as we'll submit another IO as soon as we can.

I think we will eventually need some logic to only start submitting again once
multiple IOs are possible. But that's another set of heuristics, so a project
for another day :)


In my experiments the batching was primarily useful to allow to reduce the
command submission & interrupt overhead when interacting with storage,
i.e. when actually doing IO, not just copying from the page cache.

I have seen it help due to reducing syscalls too, but the amount of batching
and/or combining seems to have a relatively low ceiling at which it stops
helping.


> Setting aside more complicated prefetching systems, what it seems like
> we are saying is that for all "miss" cases (not in SB) a distance of
> above 1 is advantageous (unless we are only doing 1 IO). I wonder if
> there is something hacky we can do like not decaying distance below
> io_combine_limit if there has been a recent miss or growing it up to
> at least io_combine_limit if we aren't getting all hits.

I think it's true that if IO execution was all that mattered, we would want a
bit more IO in flight at all time.  However looking ahead quite deeply also
has costs:

1) The resowner and private refcount mechanisms take more CPU cycles if you
   have more buffers pinned

2) The CPU cache hit ratio goes down if there's a longer time between copying
   data into s_b and consuming it

3) If you have a scan that won't be consumed to completion, you're wasting
   more the deeper you look ahead


This is actually not hard to show:

SET max_parallel_workers_per_gather = 0;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM pgbench_accounts WHERE abalance = 3;
(on a tree that has the explain stuff, but not the patches in this thread applied)

eic	time in ms
1       1326.525
2       1325.240
4       1335.073
8       1343.440
16      1346.189
32      1356.598
64      1398.326
128     1635.081
256     1674.685
512     1677.264
1000    1680.050

This one mainly shows 2) from above, I think, but the others are measurable in
other workloads.



> > 0007: Make read_stream_reset()/end() not wait for IO
> >
> >     This is a quite experimental, not really correct as-is, patch to avoid
> >     unnecessarily waiting for in-flight IO when read_stream_reset() is done
> >     while there's in-flight IO.  This is useful for things like nestloop
> >     antioins with quals on the inner side (without the qual we'd not trigger
> >     any readahead, as that's deferred in the index prefetching patch).
> >
> >     As-is this will leave IOs visible in pg_aios for a while, potentially
> >     until the backends exit. That's not right.
> 
> Separating the problems: the handle slot exhaustion seems like it
> could be solved by having the backend process discard IOs when it
> needs one and there isn't any.

I don't think it could lead to exhaustion of handles, pgaio_io_acquire() will
call pgaio_io_wait_for_free() which will wait for the oldest IO.


> The pg_aios view problems seem solvable with a flag on the IO like
> "DISCARDED". But the buffers staying pinned is different. It seems
> like you'll need the backend to process the discarded IOs at some
> point. Maybe it should do that before idling waiting for input?

The easiest way would be to actually leave the IOs registered with the
resowner and have it wait for completion at command or transaction end if not
already done.  But we currently don't really do that with resowners in the
!abort case. I'm not sure if anybody would mind doing differently here.

Another approach would be to do it in AtEOXact_Aio(), but that would mean the
IOs could hang around for a while.

A third approach could be to do one of the above, but add some additional that
go through in flight IOs and check if they completed, e.g. in
pgaio_io_acquire_nb().



> When discarding IOs, I don't really understand why the foreign IO
> path, doesn't just clear its own wait ref (not the buffer descriptor
> one) and move on -- instead you have it wait.

That'd be ok to do, I just didn't want to think about that more complicated
and less common case :)


> I haven't finished reviewing 0008 yet.

I've attached a version of what was 0008 split into two, one to introduce the
new helpers, one to introduce the separate combine distance.


I've pushed what was 0001 and 0002.  Will push the former 0003 shortly.



Greetings,

Andres Freund


^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-01 23:20   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-02 12:30     ` Nazir Bilal Yavuz <[email protected]>
  2026-04-02 13:33       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Nazir Bilal Yavuz @ 2026-04-02 12:30 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On Thu, 2 Apr 2026 at 02:20, Andres Freund <[email protected]> wrote:
>
> I've pushed what was 0001 and 0002.  Will push the former 0003 shortly.

0001 LGTM.


0002:  read_stream: Prevent distance from decaying too quickly

+        /*
+         * As we needed IO, prevent distance from being reduced within our
+         * maximum look-ahead window. This avoids having distance collapse too
+         * quickly in workloads where most of the required blocks are cached,
+         * but where the remaining IOs are a sufficient enough factor to cause
+         * a substantial slowdown if executed synchronously.
+         *
+         * There are valid arguments for preventing decay for max_ios or for
+         * max_pinned_buffers.  But the argument for max_pinned_buffers seems
+         * clearer - if we can't see any misses within the maximum look-ahead
+         * distance, we can't do any useful read-ahead.
+         */
+        stream->distance_decay_holdoff = stream->max_pinned_buffers;

That is already committed but I have a question. Did you think about
setting stream->distance_decay_holdoff to current stream->distance?
This will also fix 'miss followed by a hit' (it won't fix double
missed followed by a hit, though).


0003:  aio: io_uring: Trigger async processing for large IOs

There is a small optimization opportunity, we don't need to calculate
io_size for the DIO since pgaio_uring_should_use_async() will always
return false. Do you think it is worth implementing this? Other than
that, LGTM.


0004:  read_stream: Only increase distance when waiting for IO

LGTM.


0005:  WIP: read_stream: Move logic about IO combining & issuing to helpers

    /* never pin more buffers than allowed */
    if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
        return false;

    /*
     * Don't start more read-ahead if that'd put us over the distance limit
     * for doing read-ahead.
     */
    if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->distance)
        return false;

AFAIK, stream->distance can't be higher than
stream->max_pinned_buffers [1], so I think we can remove the first if
block. If we are not sure about [1], stream->max_pinned_buffers most
likely will be higher than stream->distance, it would make sense to
change the order of these.

Aha, I understood this change after looking at 0006. It is nitpick but
'if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)' block can be added in 0006. Other than
these, LGTM.


0006:  WIP: read stream: Split decision about look ahead for AIO and combining

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

+                /*
+                 * XXX: Should we actually reduce this at any time other than
+                 * a reset? For now we have to, as this is also a condition
+                 * for re-enabling fast_path.
+                 */
+                if (stream->combine_distance > 1)
+                    stream->combine_distance--;

I don't think we need to reduce this other than reset.

+    /*
+     * Allow looking further ahead if we have an the process of building a
+     * larger IO, the IO is not yet big enough and we don't yet have IO in
+     * flight.  Note that this is allowed even if we are reaching the
+     * read-ahead limit (but not the buffer pin limit).
+     *
+     * This is important for cases where either effective_io_concurrency is
+     * low or we never need to wait for IO and thus are not increasing the
+     * distance. Without this we would end up with lots of small IOs.
+     */
+    if (stream->pending_read_nblocks > 0 &&
+        stream->pinned_buffers == 0 &&
+        stream->pending_read_nblocks < stream->combine_distance)
+        return true;

Do we need to check 'stream->pinned_buffers == 0' here? I think we can
continue to look ahead although there are pinned buffers as we already
know 'stream->pinned_buffers + stream->pending_read_nblocks <
stream->max_pinned_buffers'.

+    /* same if capped not by io_combine_limit but combine_distance */
+    if (stream->combine_distance > 0 &&
+        pending_read_nblocks >= stream->combine_distance)
+        return true;

I think we don't need to check 'stream->combine_distance > 0', it
shouldn't be less or equal than zero when the
'stream->readahead_distance' is not zero, right?

+    {
+        stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+        stream->combine_distance = stream->io_combine_limit;
+    }

I think the stream->combine_distance should be equal to
stream->readahead_distance as we don't want to surpass
max_pinned_buffers.

0007:  Hacky implementation of making read_stream_reset()/end() not wait for IO

Read stream parts LGTM. I don't have a comment on the FIXME part, though.


-- 
Regards,
Nazir Bilal Yavuz
Microsoft





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-01 23:20   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 12:30     ` Re: AIO / read stream heuristics adjustments for index prefetching Nazir Bilal Yavuz <[email protected]>
@ 2026-04-02 13:33       ` Andres Freund <[email protected]>
  2026-04-03 16:45         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Andres Freund @ 2026-04-02 13:33 UTC (permalink / raw)
  To: Nazir Bilal Yavuz <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-04-02 15:30:26 +0300, Nazir Bilal Yavuz wrote:
> 0002:  read_stream: Prevent distance from decaying too quickly
>
> +        /*
> +         * As we needed IO, prevent distance from being reduced within our
> +         * maximum look-ahead window. This avoids having distance collapse too
> +         * quickly in workloads where most of the required blocks are cached,
> +         * but where the remaining IOs are a sufficient enough factor to cause
> +         * a substantial slowdown if executed synchronously.
> +         *
> +         * There are valid arguments for preventing decay for max_ios or for
> +         * max_pinned_buffers.  But the argument for max_pinned_buffers seems
> +         * clearer - if we can't see any misses within the maximum look-ahead
> +         * distance, we can't do any useful read-ahead.
> +         */
> +        stream->distance_decay_holdoff = stream->max_pinned_buffers;
>
> That is already committed but I have a question. Did you think about
> setting stream->distance_decay_holdoff to current stream->distance?
> This will also fix 'miss followed by a hit' (it won't fix double
> missed followed by a hit, though).

I am fairly certain that would not work well, because the whole point of the
mechanism it to prevent the distance from staying too small when the distance
is low.

We want the distance to grow for cases like mis, hit{10}, miss, ... because
that will typically still be bottlenecked by IO latency. And something like
setting the decay holdoff to the current distance wouldn't work for that.


> 0003:  aio: io_uring: Trigger async processing for large IOs
>
> There is a small optimization opportunity, we don't need to calculate
> io_size for the DIO since pgaio_uring_should_use_async() will always
> return false. Do you think it is worth implementing this? Other than
> that, LGTM.

I doubt it. Compared to actually doing IO the cost of this is neglegible.


> 0005:  WIP: read_stream: Move logic about IO combining & issuing to helpers
>
>     /* never pin more buffers than allowed */
>     if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)
>         return false;
>
>     /*
>      * Don't start more read-ahead if that'd put us over the distance limit
>      * for doing read-ahead.
>      */
>     if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->distance)
>         return false;
>
> AFAIK, stream->distance can't be higher than
> stream->max_pinned_buffers [1], so I think we can remove the first if
> block. If we are not sure about [1], stream->max_pinned_buffers most
> likely will be higher than stream->distance, it would make sense to
> change the order of these.
>
> Aha, I understood this change after looking at 0006. It is nitpick but
> 'if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)' block can be added in 0006. Other than
> these, LGTM.

Hm. I guess that could make sense.


> 0006:  WIP: read stream: Split decision about look ahead for AIO and combining
>
> I liked the idea of being more aggressive to do IO combining. What is
> the reason for gradually increasing combine_distance, is it to not do
> unnecessary IOs at the start?

Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
hurt with index scan prefetching, because of query plans where we only consume
part of the index scan (e.g. a nested loop antijoin).


> +                /*
> +                 * XXX: Should we actually reduce this at any time other than
> +                 * a reset? For now we have to, as this is also a condition
> +                 * for re-enabling fast_path.
> +                 */
> +                if (stream->combine_distance > 1)
> +                    stream->combine_distance--;
>
> I don't think we need to reduce this other than reset.

Hm. I go back and forth on that one :)


> +    /*
> +     * Allow looking further ahead if we have an the process of building a
> +     * larger IO, the IO is not yet big enough and we don't yet have IO in
> +     * flight.  Note that this is allowed even if we are reaching the
> +     * read-ahead limit (but not the buffer pin limit).
> +     *
> +     * This is important for cases where either effective_io_concurrency is
> +     * low or we never need to wait for IO and thus are not increasing the
> +     * distance. Without this we would end up with lots of small IOs.
> +     */
> +    if (stream->pending_read_nblocks > 0 &&
> +        stream->pinned_buffers == 0 &&
> +        stream->pending_read_nblocks < stream->combine_distance)
> +        return true;
>
> Do we need to check 'stream->pinned_buffers == 0' here? I think we can
> continue to look ahead although there are pinned buffers as we already
> know 'stream->pinned_buffers + stream->pending_read_nblocks <
> stream->max_pinned_buffers'.

We could, but I don't think there would be a benefit in doing so. In my mind,
what combine_distance is used for, is to control how far to look ahead to
allow for IO combining when the readahead_distance is too low to allow for IO
combining.  But if pinned_buffers > 0, we already have another IO in flight,
so the combine_distance mechanism doesn't have to kick in.

I've been experimenting with going the other way, by having
read_stream_should_look_ahead() do a check like

    /*
     * If we already have IO in flight, but are close enough to to the
     * distance limit that we would not start a fully sized IO, don't even
     * start a pending read until later.
     *
     * This avoids calling the call thing the next block callback in cases we
     * would not start the pending read anyway. For some users the work to
     * determine the next block is non-trivial, so we don't want to do so
     * earlier than necessary.
     *
     * A secondary benefit of this is that some callers use parallel workers
     * with each their own read stream to process a global list of blocks, and
     * only calling the next block callback when ready to actually issue IO
     * makes it more likely for one backend to get consecutive blocks.
     */
    if (stream->pinned_buffers > 0 &&
        stream->pending_read_nblocks == 0 &&
        stream->pinned_buffers + stream->combine_distance >= stream->readahead_distance)
        return false;


I do see that improve performance with some bitmap heap scan queries
(e.g. Tomas' "cyclic" ones). But I'm worried it'd hurt other queries when
using a low effective_io_concurrency.  So that's probably something for later.


> +    /* same if capped not by io_combine_limit but combine_distance */
> +    if (stream->combine_distance > 0 &&
> +        pending_read_nblocks >= stream->combine_distance)
> +        return true;
>
> I think we don't need to check 'stream->combine_distance > 0', it
> shouldn't be less or equal than zero when the
> 'stream->readahead_distance' is not zero, right?

Yea. I guess we could just assert that.


> +    {
> +        stream->readahead_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> +        stream->combine_distance = stream->io_combine_limit;
> +    }
>
> I think the stream->combine_distance should be equal to
> stream->readahead_distance as we don't want to surpass
> max_pinned_buffers.

I don't think it could lead to exceeding max_pinned_buffers, due to the
    /* never pin more buffers than allowed */

check in read_stream_should_look_ahead(), but there's no reason to rely on
that.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-01 23:20   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 12:30     ` Re: AIO / read stream heuristics adjustments for index prefetching Nazir Bilal Yavuz <[email protected]>
  2026-04-02 13:33       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-03 16:45         ` Melanie Plageman <[email protected]>
  2026-04-03 19:01           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-03 16:45 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Thu, Apr 2, 2026 at 9:33 AM Andres Freund <[email protected]> wrote:
>
> > +                /*
> > +                 * XXX: Should we actually reduce this at any time other than
> > +                 * a reset? For now we have to, as this is also a condition
> > +                 * for re-enabling fast_path.
> > +                 */
> > +                if (stream->combine_distance > 1)
> > +                    stream->combine_distance--;
> >
> > I don't think we need to reduce this other than reset.
>
> Hm. I go back and forth on that one :)

Separate from the fast-path enablement, we also probably want to
decrease combine distance when we decrease readahead_distance because
there is a point where we still want to parallelize the IOs even when
the distance is lower and to do that, we need to make smaller IOs. I'm
not sure where this point is, but I wonder if a few 256kB IOs is
faster than 1 1MB IO (could test that with fio actually). I imagine
that there is some size where that is true because of peculiarities in
how drives (and cloud storage) issue/break up IOs after they are a
certain size, etc.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-03-31 20:59 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-01 23:20   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 12:30     ` Re: AIO / read stream heuristics adjustments for index prefetching Nazir Bilal Yavuz <[email protected]>
  2026-04-02 13:33       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 16:45         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-03 19:01           ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-03 19:01 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-04-03 12:45:50 -0400, Melanie Plageman wrote:
> On Thu, Apr 2, 2026 at 9:33 AM Andres Freund <[email protected]> wrote:
> >
> > > +                /*
> > > +                 * XXX: Should we actually reduce this at any time other than
> > > +                 * a reset? For now we have to, as this is also a condition
> > > +                 * for re-enabling fast_path.
> > > +                 */
> > > +                if (stream->combine_distance > 1)
> > > +                    stream->combine_distance--;
> > >
> > > I don't think we need to reduce this other than reset.
> >
> > Hm. I go back and forth on that one :)
> 
> Separate from the fast-path enablement, we also probably want to
> decrease combine distance when we decrease readahead_distance because
> there is a point where we still want to parallelize the IOs even when
> the distance is lower and to do that, we need to make smaller IOs.

I'm not sure that's something we really need to worry about at this point. If
readahead_distance is so small that it does not allow enough IO concurrency,
we will have to wait for IO completion, which in turn will lead to the
readahead distance being increased again.

I can see some corner cases where this would not suffice, e.g. if you have a
rather low pin limit, but I doubt those are relevant in practice?


> I'm not sure where this point is, but I wonder if a few 256kB IOs is faster
> than 1 1MB IO (could test that with fio actually).

Yes, that point definitely exists. But I think the mechanism for that is to
configure io_combine_limit at or below the threshold at which even bigger IOs
hurt.


> I imagine that there is some size where that is true because of
> peculiarities in how drives (and cloud storage) issue/break up IOs after
> they are a certain size, etc.

It's even true for synchronous copies from the kernel page cache, due to some
hardware issue I have yet to fully understand. On both Intel and AMD CPUs,
unless SMAP is disabled, larger copies from kernel to userspace start to to be
substantially slower, somewhere around 1-4MBs per IO.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-01 14:52 ` Melanie Plageman <[email protected]>
  2026-04-01 15:49   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  3 siblings, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-01 14:52 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
>
> 0008: WIP: read stream: Split decision about look ahead for AIO and combining
>
>     Until now read stream has used a single look-ahead distance to control
>     lookahead for both IO combining and read-ahead. That's sub-optimal, as we
>     want to do IO combining even when we don't need to do any readahead, as
>     avoiding the syscall overhead is important to reduce CPU overhead when
>     data is in the kernel page cache.
>
>     This is a prototype for what it could look like to split those
>     decisions. Thereby fixing the regression mentioned in 0006.

I wonder if we need to keep the combine_limit member in the read
stream. Could we just use io_combine_limit without ramping up and
down? This is mainly for code complexity reasons. Perhaps to allow
fast path reentry, we could use distance_decay_holdoff == 0 and
ios_in_progress == 0 instead of combine_distance == 0.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-01 14:52 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-01 15:49   ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-01 15:49 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-01 10:52:03 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
> >
> > 0008: WIP: read stream: Split decision about look ahead for AIO and combining
> >
> >     Until now read stream has used a single look-ahead distance to control
> >     lookahead for both IO combining and read-ahead. That's sub-optimal, as we
> >     want to do IO combining even when we don't need to do any readahead, as
> >     avoiding the syscall overhead is important to reduce CPU overhead when
> >     data is in the kernel page cache.
> >
> >     This is a prototype for what it could look like to split those
> >     decisions. Thereby fixing the regression mentioned in 0006.
>
> I wonder if we need to keep the combine_limit member in the read
> stream. Could we just use io_combine_limit without ramping up and
> down? This is mainly for code complexity reasons.

I thought so at first too, but it unfortunately leads to substantial
regressions with index prefetching, due to reading ahead unnecessarily far in
cases where we really just needed one block.


> Perhaps to allow fast path reentry, we could use distance_decay_holdoff == 0
> and ios_in_progress == 0 instead of combine_distance == 0.

Somewhat orthogonal: I really dislike the code to re-enter fastpath. I've now
broken it a few times without noticing. Especially when using a lower
distance, it's easy for the gating conditions to be fulfilled if
read_stream_look_ahead() decided to not *yet* do look ahead, because there's
still a pinned buffer and the distance is low.

ISTM that it really should only be checked after we did a lookahead and found
it to be a buffer hit.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-02 14:31 ` Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  3 siblings, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-02 14:31 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
>
> 0005+0006:  Only increase distance when waiting for IO
>
>     Until now we have increased the read ahead distance whenever there we
>     needed to do IO (doubling the distance every miss). But that will often be
>     way too aggressive, with the IO subsystem being able to keep up with a
>     much lower distance.
>
>     The idea here is to use information about whether we needed to wait for IO
>     before returning the buffer in read_stream_next_buffer() to control
>     whether we should increase the readahead distance.
>
>     This seems to work extremely well for worker.
>
>     Unfortuntely with io_uring the situation is more complicated, because
>     io_uring performs reads synchronously during submission if the data is the
>     kernel page cache.  This can reduce performance substantially compared to
>     worker, because it prevents parallelizing the copy from the page cache.
>     There is an existing heuristic for that in method_io_uring.c that adds a
>     flag to the IO submissions forcing the IO to be processed asynchronously,
>     allowing for parallelism.  Unfortunately the heuristic is triggered by the
>     number of IOs in flight - which will never become big enough to tgrigger
>     after using "needed to wait" to control how far to read ahead.

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

If io_uring async copying is advantageous even when the consumer never
needs to wait, then it seems like parallelizing copying to/from the
kernel buffer cache will always be advantageous to do at some level.

The case where it is not (as you've stated before) is when the
consumer doesn't need the extra blocks, so it is just wasted time
spent acquiring them.

So, it feels odd to try and find a heuristic that allows the readahead
distance to increase even when the consumer is not having to wait. I'm
not saying we should do this for this release, but I'm just wondering
if in the medium term, we should try to find a better way to identify
the situation where async processing is not beneficial because the
blocks won't be needed.

>     So 0005 expands the io_uring heuristic to also trigger based on the sizes
>     of IOs - but that's decidedly not perfect, we e.g. have some experiments
>     showing it regressing some parallel bitmap heap scan cases.  It may be
>     better to somehow tweak the logic to only trigger for worker.
>
>     As is this has another issue, which is that it prevents IO combining in
>     situations where it shouldn't, because right now using the distance to
>     control both. See 0008 for an attempt at splitting those concerns.

Yea, I think running ahead far enough to get bigger IOs needs to
happen and can't be based on the consumer having to wait.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-02 15:47   ` Andres Freund <[email protected]>
  2026-04-02 21:13     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  0 siblings, 2 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-02 15:47 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-02 10:31:50 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <[email protected]> wrote:
> >
> > 0005+0006:  Only increase distance when waiting for IO
> >
> >     Until now we have increased the read ahead distance whenever there we
> >     needed to do IO (doubling the distance every miss). But that will often be
> >     way too aggressive, with the IO subsystem being able to keep up with a
> >     much lower distance.
> >
> >     The idea here is to use information about whether we needed to wait for IO
> >     before returning the buffer in read_stream_next_buffer() to control
> >     whether we should increase the readahead distance.
> >
> >     This seems to work extremely well for worker.
> >
> >     Unfortuntely with io_uring the situation is more complicated, because
> >     io_uring performs reads synchronously during submission if the data is the
> >     kernel page cache.  This can reduce performance substantially compared to
> >     worker, because it prevents parallelizing the copy from the page cache.
> >     There is an existing heuristic for that in method_io_uring.c that adds a
> >     flag to the IO submissions forcing the IO to be processed asynchronously,
> >     allowing for parallelism.  Unfortunately the heuristic is triggered by the
> >     number of IOs in flight - which will never become big enough to tgrigger
> >     after using "needed to wait" to control how far to read ahead.
> 
> On some level, relying on worker mode overhead feels fragile. If
> worker overhead decreases—say, by moving to IO worker threads—we won't
> be able to rely on this to keep the distance to an advantageous level.

I don't see why lower overhead would prevent this from working?


> If io_uring async copying is advantageous even when the consumer never
> needs to wait, then it seems like parallelizing copying to/from the
> kernel buffer cache will always be advantageous to do at some level.

It's not universally advantageous, unfortunately - there's a nontrivial
increase in latency (and also some CPU) due to it. Which matters mostly when
having a shallow look-ahead depth (like at the start of a stream), because
then the latency impact will directly influence query performance.

Setup:

CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, 500)) AS pattern;

workload:

SET io_combine_limit = 1;

SET effective_io_concurrency=1;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;

(and then repeated for eic 2,4,8,16)

eic	time plain ms    time w/ forced async
1       2.331            5.366
2       2.164            3.210
4       2.151            2.677
8       2.155            2.749
16      2.151            2.742
32      2.141            2.732
64      2.161            2.739
128     2.153            2.652

Note that forced async never quite catches up.

If I instead make the pattern 50k blocks long:

eic	time plain ms    time w/ forced async
1       210.678          454.132
2       209.210          281.452
4       208.775          198.496
8       208.755          198.131
16      209.477          195.799
32      203.497          183.297
64      203.002          173.799
128     202.885          166.548



> The case where it is not (as you've stated before) is when the
> consumer doesn't need the extra blocks, so it is just wasted time
> spent acquiring them.

That's one reason, but as shown above, it's also that the increase in latency
can hurt, particularly in the first few blocks (where we are ramping up the
distance) and when effective_io_concurrency is too low to allow for a deep
enough read-ahead to allow to hide the latency increase.


> So, it feels odd to try and find a heuristic that allows the readahead
> distance to increase even when the consumer is not having to wait.

Do you still feel like that with the added context from the above?


> I'm not saying we should do this for this release, but I'm just wondering if
> in the medium term, we should try to find a better way to identify the
> situation where async processing is not beneficial because the blocks won't
> be needed.

I think we certainly can do better than today with some help, e.g. from the
planner, to identify cases where we should be more careful about reading ahead
too far, e.g. due to being on the inner side of an nestloop antijoin.



> >     So 0005 expands the io_uring heuristic to also trigger based on the sizes
> >     of IOs - but that's decidedly not perfect, we e.g. have some experiments
> >     showing it regressing some parallel bitmap heap scan cases.  It may be
> >     better to somehow tweak the logic to only trigger for worker.
> >
> >     As is this has another issue, which is that it prevents IO combining in
> >     situations where it shouldn't, because right now using the distance to
> >     control both. See 0008 for an attempt at splitting those concerns.
> 
> Yea, I think running ahead far enough to get bigger IOs needs to
> happen and can't be based on the consumer having to wait.

What do you think about the updated patch to achieve that that I posted?

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-02 21:13     ` Melanie Plageman <[email protected]>
  2026-04-02 21:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  1 sibling, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-02 21:13 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <[email protected]> wrote:
>
> > On some level, relying on worker mode overhead feels fragile. If
> > worker overhead decreases—say, by moving to IO worker threads—we won't
> > be able to rely on this to keep the distance to an advantageous level.
>
> I don't see why lower overhead would prevent this from working?

needed_wait has to be true to increase the readahead distance and for
io_uring, when data was in the kernel buffer cache, needed_wait is
false, meaning the distance doesn't increase. Worker mode didn't have
this problem because of overhead. So needed_wait is true for workers.
But, now that we will have combine_distance, I guess we don't need to
rely on workers having overhead. So we are saying that
readahead_distance is completely irrelevant for copying from the
kernel buffer cache and only combine_distance matters for that now,
right?

> > Yea, I think running ahead far enough to get bigger IOs needs to
> > happen and can't be based on the consumer having to wait.
>
> What do you think about the updated patch to achieve that that I posted?

Will post separately.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 21:13     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-02 21:30       ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-02 21:30 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-02 17:13:34 -0400, Melanie Plageman wrote:
> On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <[email protected]> wrote:
> >
> > > On some level, relying on worker mode overhead feels fragile. If
> > > worker overhead decreases—say, by moving to IO worker threads—we won't
> > > be able to rely on this to keep the distance to an advantageous level.
> >
> > I don't see why lower overhead would prevent this from working?
> 
> needed_wait has to be true to increase the readahead distance and for
> io_uring, when data was in the kernel buffer cache, needed_wait is
> false, meaning the distance doesn't increase. Worker mode didn't have
> this problem because of overhead. So needed_wait is true for workers.
> But, now that we will have combine_distance, I guess we don't need to
> rely on workers having overhead.

I think we still do, but that it will continue to work, even if the overhead
is much smaller than today.  The workers will complete the IOs only after the
memory copy is finished (duh). Therefore, if the distance is too small to
allow workers to complete the copy, the distance will be increased, due to
needed_wait.


> So we are saying that readahead_distance is completely irrelevant for
> copying from the kernel buffer cache and only combine_distance matters for
> that now, right?

I don't think so! The combine_distance thing is crucial to allow for IO
combining, and, indirectly, for triggering the size based "async" heuristics
with io_uring. Once the io_uring async heuristic is triggered, the needed_wait
mechanism works to further increase the distance.

That does mean that for random BLCKSZ sized IOs that are in the page cache the
async mechanism won't typically be triggered - but from what I can tell that's
ok, because lots of 8kB IOs is also where the dispatch overhead to the kernel
threads doing the copying is the highest.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-03 15:46     ` Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 20:41       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  1 sibling, 2 replies; 23+ messages in thread

From: Melanie Plageman @ 2026-04-03 15:46 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <[email protected]> wrote:
>
> What do you think about the updated patch to achieve that that I posted?

here is some review on 0005 and 0006 earlier posted
concrete things:
-------
- I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
issue), I found myself asking why it wasn’t first

- I agree with bilal that
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
    return false;
should be in the other commit because it isn’t required with the
current code because of how distance is set and is more confusing.
wasn’t it an assert before?

- perhaps the new heuristic for allowing further look ahead if we are
building an IO and it isn’t big enough yet should be in its own commit

- why have read_stream_pause() save combine_distance if it isn’t going
to zero it out?
- I think the comments you added to the read stream struct for
combine_distance and readahead_distance should indicate units (i.e.
blocks)

- why not remove the combine_distance requirement from the fast path
entry criteria (could save resume_combine_distance in the fast path
and restore it after a miss)

- can we move the fast path to where we found out we got a hit?

- should_issue_now has a lot of overlap with should_read_ahead which
makes it confusing to figure out how they are different, but I think
that is related to readahead_distance being in units of blocks and not
IOs (which I talk about later)

- There's also some assorted typos and comment awkwardness that I
assume you will clean up in the polish step
(e.g. "if we have an the process" -> "if we are in the process",
"oveflow" -> "overflow", The NB comment still says stream->distance ==
0 but should say stream->readahead_distance == 0)

some more ambiguous stuff:
-------
> In my experiments the batching was primarily useful to allow to reduce the
> command submission & interrupt overhead when interacting with storage, i.e.
> when actually doing IO, not just copying from the page cache.

It still seems to me that batching would help guarantee parallel
memcpys for workers when data is in the kernel buffer cache because if
the IO is quick, the same worker may pick up the next IO.
---
You mentioned that we don't want to read too far ahead (including for
a single combined IO) in part because:

> The resowner and private refcount mechanisms take more CPU cycles if you have
> more buffers pinned

But I don't see how either distance is responding to this or
self-calibrating with this in mind
---
>> I liked the idea of being more aggressive to do IO combining. What is
>> the reason for gradually increasing combine_distance, is it to not do
>> unnecessary IOs at the start?
>
> Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
> hurt with index scan prefetching, because of query plans where we only consume
> part of the index scan (e.g. a nested loop antijoin).

You said we have to ramp up combine_distance because in index
prefetching when we don’t need all the blocks, it can hurt. But if we
ramp it up unconditionally (assuming we did IO), then I don’t see how
this would solve the problem as it will ramp up regardless after just
a few IOs anyway.
---
> I've been experimenting with going the other way, by having
> read_stream_should_look_ahead() do a check like

    /*
     * If we already have IO in flight, but are close enough to to the
     * distance limit that we would not start a fully sized IO, don't even
     * start a pending read until later.
     *
     * This avoids calling the call thing the next block callback in cases we
     * would not start the pending read anyway. For some users the work to
     * determine the next block is non-trivial, so we don't want to do so
     * earlier than necessary.
     *
     * A secondary benefit of this is that some callers use parallel workers
     * with each their own read stream to process a global list of blocks, and
     * only calling the next block callback when ready to actually issue IO
     * makes it more likely for one backend to get consecutive blocks.
     */
    if (stream->pinned_buffers > 0 &&
        stream->pending_read_nblocks == 0 &&
        stream->pinned_buffers + stream->combine_distance >=
stream->readahead_distance)
        return false;

So, as you say, with, for example, a large io_combine_limit and small
effective_io_concurrency, this would result in much lower IO
concurrency than we want. But, I think this speaks to the central
tension I see with the new combine_distance and readahead_distance: it
seems like readahead_distance should now be in units of IO and
combine_distance in units of blocks. If that were the case, this
heuristic wouldn't have a downside.

Obviously ramping readahead_distance up and down when it is in units
of IO becomes much more fraught though.

But our criteria for wanting to make bigger IOs is different than our
criteria for wanting more IOs in flight.
---
I think it is weird to have combine_distance only be relevant when
readahead_distance is low. You said:

> We could, but I don't think there would be a benefit in doing so. In my mind,
> what combine_distance is used for, is to control how far to look ahead to
> allow for IO combining when the readahead_distance is too low to allow for IO
> combining.  But if pinned_buffers > 0, we already have another IO in flight,
> so the combine_distance mechanism doesn't have to kick in.

But it seems like for a completely uncached workload bigger IOs is
still beneficial. We may want to avoid reading ahead to get bigger IOs
when there is a chance we only need a few blocks. And, we want to
parallelize copies across workers, so we want a combine_distance that
is small enough that, relative to effective_io_concurrency, we can
split IOs across workers some. But, otherwise, we want maximally sized
IOs. In the medium term, perhaps we can try to use executor hints to
handle the former. And, assuming you think readahead_distance in IOs
is a non-starter, maybe we could use a heuristic that tries to balance
IOs by dividing readahead_distance by combine_distance when
combine_distance is sufficiently high or something like that.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-03 17:30       ` Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 23+ messages in thread

From: Andres Freund @ 2026-04-03 17:30 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-03 11:46:59 -0400, Melanie Plageman wrote:
> On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <[email protected]> wrote:
> >
> > What do you think about the updated patch to achieve that that I posted?
>
> here is some review on 0005 and 0006 earlier posted
> concrete things:
> -------
> - I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
> issue), I found myself asking why it wasn’t first
>
> - I agree with bilal that
> if (stream->pinned_buffers + stream->pending_read_nblocks >=
> stream->max_pinned_buffers)
>     return false;
> should be in the other commit because it isn’t required with the
> current code because of how distance is set and is more confusing.
> wasn’t it an assert before?

Will do.


> - perhaps the new heuristic for allowing further look ahead if we are
> building an IO and it isn’t big enough yet should be in its own commit

Isn't that the whole point of the commit? What else is there?


> - why have read_stream_pause() save combine_distance if it isn’t going
> to zero it out?

It should zero it out.


> - I think the comments you added to the read stream struct for
> combine_distance and readahead_distance should indicate units (i.e.
> blocks)

Will do.


> - why not remove the combine_distance requirement from the fast path
> entry criteria (could save resume_combine_distance in the fast path
> and restore it after a miss)

Because entering the fast path prevents IO combining from happening. So it's
absolutely crucial that we do *not* enter it while we would combine.


> - can we move the fast path to where we found out we got a hit?

Yea, I think we should. Seems like a separate project though.


> - should_issue_now has a lot of overlap with should_read_ahead which
> makes it confusing to figure out how they are different

Yea. I'm not sure how to do it better though.


> but I think that is related to readahead_distance being in units of blocks
> and not IOs (which I talk about later)

I agree that we should change that, but it seems like a separate project to
me.


> - There's also some assorted typos and comment awkwardness that I
> assume you will clean up in the polish step
> (e.g. "if we have an the process" -> "if we are in the process",
> "oveflow" -> "overflow", The NB comment still says stream->distance ==
> 0 but should say stream->readahead_distance == 0)

Yea.


> some more ambiguous stuff:
> -------
> > In my experiments the batching was primarily useful to allow to reduce the
> > command submission & interrupt overhead when interacting with storage, i.e.
> > when actually doing IO, not just copying from the page cache.
>
> It still seems to me that batching would help guarantee parallel
> memcpys for workers when data is in the kernel buffer cache because if
> the IO is quick, the same worker may pick up the next IO.

When you say workers, do you mean for io_method=worker? Or the internal kernel
threads of io_uring for async IOs?

> ---
> You mentioned that we don't want to read too far ahead (including for
> a single combined IO) in part because:
>
> > The resowner and private refcount mechanisms take more CPU cycles if you have
> > more buffers pinned
>
> But I don't see how either distance is responding to this or
> self-calibrating with this in mind

Using the minimal required distance to avoid needing to wait for IO completion
is responding to that, no?  Without these patches we read ahead as far as
possible, even if all the data is in the page cache, which makes this issue
way worse (without these patches it's a major source of regressions in the
index prefetching patch).


> ---
> >> I liked the idea of being more aggressive to do IO combining. What is
> >> the reason for gradually increasing combine_distance, is it to not do
> >> unnecessary IOs at the start?
> >
> > Yea. It'd perhaps not be too bad with the existing users, but it'd *really*
> > hurt with index scan prefetching, because of query plans where we only consume
> > part of the index scan (e.g. a nested loop antijoin).
>
> You said we have to ramp up combine_distance because in index
> prefetching when we don’t need all the blocks, it can hurt. But if we
> ramp it up unconditionally (assuming we did IO), then I don’t see how
> this would solve the problem as it will ramp up regardless after just
> a few IOs anyway.

It limits the percentage of "wasted IO". So it does not entirely solve it - I
think that's impossible - but reduces it enough to be ok.

If the consumer stops consuming after e.g. 1 IOs, an immediate rampup would
lead to 15 IOs being wasted (assuming io_combine_limit 16). So only ~6% of the
read in blocks are used.

If instead the consumer stops only after already having ramped up to the max
combine distance, it'd have to have consumed a lot more buffers, as the
distance is increased only after consuming a buffer that required IO.

Think it's something like:

io of size 1, block 0
consume buffer #1 -> combine_distance 2
io of size 2, block 1-2
consume buffer #2 -> combine_distance 4
consume buffer #3
io of size 4, block 3-6
consume buffer #4 -> combine_distance 8
consume buffer #4-#7
io of size 8, block 7-14
consume buffer #8 -> combine_distance 16
consume buffer #9-#15
io of size 16, block 15-30
consume buffer #16 -> combine distance can't be increased further
<reset>

Consumer consumed 16 buffers, we did IO for 31 blocks, wasted 15. So we used
~51% of the read in blocks are used.

While 51% is not perfect, it still a considerable improvement in worst-case
wasteage. It's also less of a problem because more CPU was spent in the
consumer, given that 16 buffers had to be processed to get to this point.


> ---
> > I've been experimenting with going the other way, by having
> > read_stream_should_look_ahead() do a check like
>
>     /*
>      * If we already have IO in flight, but are close enough to to the
>      * distance limit that we would not start a fully sized IO, don't even
>      * start a pending read until later.
>      *
>      * This avoids calling the call thing the next block callback in cases we
>      * would not start the pending read anyway. For some users the work to
>      * determine the next block is non-trivial, so we don't want to do so
>      * earlier than necessary.
>      *
>      * A secondary benefit of this is that some callers use parallel workers
>      * with each their own read stream to process a global list of blocks, and
>      * only calling the next block callback when ready to actually issue IO
>      * makes it more likely for one backend to get consecutive blocks.
>      */
>     if (stream->pinned_buffers > 0 &&
>         stream->pending_read_nblocks == 0 &&
>         stream->pinned_buffers + stream->combine_distance >=
> stream->readahead_distance)
>         return false;
>
> So, as you say, with, for example, a large io_combine_limit and small
> effective_io_concurrency, this would result in much lower IO
> concurrency than we want. But, I think this speaks to the central
> tension I see with the new combine_distance and readahead_distance: it
> seems like readahead_distance should now be in units of IO and
> combine_distance in units of blocks. If that were the case, this
> heuristic wouldn't have a downside.
>
> Obviously ramping readahead_distance up and down when it is in units
> of IO becomes much more fraught though.

I think we might want to move to that world, but I think that'll require a lot
more work to validate. In cases where you have a mix of hits and misses, with
the misses requiring actual IO, we'd end up with a lot less readahead than
today.


> ---
> I think it is weird to have combine_distance only be relevant when
> readahead_distance is low. You said:
>
> > We could, but I don't think there would be a benefit in doing so. In my mind,
> > what combine_distance is used for, is to control how far to look ahead to
> > allow for IO combining when the readahead_distance is too low to allow for IO
> > combining.  But if pinned_buffers > 0, we already have another IO in flight,
> > so the combine_distance mechanism doesn't have to kick in.
>
> But it seems like for a completely uncached workload bigger IOs is
> still beneficial.

Massively so - the storage layer getting too small IOs really hurts.

But, as the code stands, we *do* end up with large IOs in that case, because
we will not issue the IO until it is "complete". If we need to do actual IO,
the readahead_distance will be larger, and allow multiple full sized IOs to be
issued.


/*
 * We don't start the pending read just because we've hit the distance limit,
 * preferring to give it another chance to grow to full io_combine_limit size
 * once more buffers have been consumed.  But this is not desirable in all
 * situations - see below.
 */
static inline bool
read_stream_should_issue_now(ReadStream *stream)
{
...
	/*
	 * If we've already reached io_combine_limit, there's no chance of growing
	 * the read further.
	 */
	if (pending_read_nblocks >= stream->io_combine_limit)
		return true;

	/* same if capped not by io_combine_limit but combine_distance */
	if (stream->combine_distance > 0 &&
		pending_read_nblocks >= stream->combine_distance)
		return true;

	/*
	 * If we currently have no reads in flight or prepared, issue the IO once
	 * we are not looking ahead further. This ensures there's always at least
	 * one IO prepared.
	 */
	if (stream->pinned_buffers == 0 &&
		!read_stream_should_look_ahead(stream))
		return true;

	return false;

So, unless we are not reading ahead, we are not issuing IOs until they are
fully sized (or can't be combined, obviously).


Am I misunderstanding what you're talking about?


Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-03 19:04         ` Melanie Plageman <[email protected]>
  2026-04-03 20:36           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-03 19:04 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <[email protected]> wrote:
>
> > - why not remove the combine_distance requirement from the fast path
> > entry criteria (could save resume_combine_distance in the fast path
> > and restore it after a miss)
>
> Because entering the fast path prevents IO combining from happening. So it's
> absolutely crucial that we do *not* enter it while we would combine.

But if it is a buffer hit, obviously we can't do IO combining anyway,
or am I misunderstanding the fast path's common case?

> > > In my experiments the batching was primarily useful to allow to reduce the
> > > command submission & interrupt overhead when interacting with storage, i.e.
> > > when actually doing IO, not just copying from the page cache.
> >
> > It still seems to me that batching would help guarantee parallel
> > memcpys for workers when data is in the kernel buffer cache because if
> > the IO is quick, the same worker may pick up the next IO.
>
> When you say workers, do you mean for io_method=worker? Or the internal kernel
> threads of io_uring for async IOs?

I'm talking about io_method worker.

> > You mentioned that we don't want to read too far ahead (including for
> > a single combined IO) in part because:
> >
> > > The resowner and private refcount mechanisms take more CPU cycles if you have
> > > more buffers pinned
> >
> > But I don't see how either distance is responding to this or
> > self-calibrating with this in mind
>
> Using the minimal required distance to avoid needing to wait for IO completion
> is responding to that, no?  Without these patches we read ahead as far as
> possible, even if all the data is in the page cache, which makes this issue
> way worse (without these patches it's a major source of regressions in the
> index prefetching patch).

But we aren't using the minimal distance to avoid needing to wait for
IO completion. We are also using a higher distance to try and get IO
combining and  toallow for async copying into the kernel buffer cache,
etc, etc. There's a lot of different considerations; it isn't just two
opposing forces. And, I'd imagine that the relationship between the
number of buffers pinned and CPU cycles required for resowner/refcount
isn't perfectly linear.

> > I think it is weird to have combine_distance only be relevant when
> > readahead_distance is low. You said:
> >
> > > We could, but I don't think there would be a benefit in doing so. In my mind,
> > > what combine_distance is used for, is to control how far to look ahead to
> > > allow for IO combining when the readahead_distance is too low to allow for IO
> > > combining.  But if pinned_buffers > 0, we already have another IO in flight,
> > > so the combine_distance mechanism doesn't have to kick in.
> >
> > But it seems like for a completely uncached workload bigger IOs is
> > still beneficial.
>
> Massively so - the storage layer getting too small IOs really hurts.
>
> But, as the code stands, we *do* end up with large IOs in that case, because
> we will not issue the IO until it is "complete". If we need to do actual IO,
> the readahead_distance will be larger, and allow multiple full sized IOs to be
> issued.
>
> /*
>  * We don't start the pending read just because we've hit the distance limit,
>  * preferring to give it another chance to grow to full io_combine_limit size
>  * once more buffers have been consumed.  But this is not desirable in all
>  * situations - see below.
>  */
> static inline bool
> read_stream_should_issue_now(ReadStream *stream)
> {
> ...
>         /*
>          * If we've already reached io_combine_limit, there's no chance of growing
>          * the read further.
>          */
>         if (pending_read_nblocks >= stream->io_combine_limit)
>                 return true;
>
>         /* same if capped not by io_combine_limit but combine_distance */
>         if (stream->combine_distance > 0 &&
>                 pending_read_nblocks >= stream->combine_distance)
>                 return true;
>
>         /*
>          * If we currently have no reads in flight or prepared, issue the IO once
>          * we are not looking ahead further. This ensures there's always at least
>          * one IO prepared.
>          */
>         if (stream->pinned_buffers == 0 &&
>                 !read_stream_should_look_ahead(stream))
>                 return true;
>
>         return false;
>
> So, unless we are not reading ahead, we are not issuing IOs until they are
> fully sized (or can't be combined, obviously).
>
> Am I misunderstanding what you're talking about?

I'm not saying that we don't do IO combining at high distances, I'm
more saying that it is confusing that combine_distance controls how
far we look ahead when readahead_distance is low but when
readahead_distance is high, it controls when we issue the IO and not
how far we look ahead. I don't think we should change course now, but
I wanted to call out that this felt a little uncomfortable to me.

- Melanie





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-03 20:36           ` Andres Freund <[email protected]>
  2026-04-03 23:10             ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-04 00:06             ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  0 siblings, 2 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-03 20:36 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-03 15:04:51 -0400, Melanie Plageman wrote:
> On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <[email protected]> wrote:
> >
> > > - why not remove the combine_distance requirement from the fast path
> > > entry criteria (could save resume_combine_distance in the fast path
> > > and restore it after a miss)
> >
> > Because entering the fast path prevents IO combining from happening. So it's
> > absolutely crucial that we do *not* enter it while we would combine.
>
> But if it is a buffer hit, obviously we can't do IO combining anyway,
> or am I misunderstanding the fast path's common case?

It's true that we can't do combining in the fast path, but the problem is that
with eic=0/1 (or a recent history that leaves us with low distances or a low
pin limit), we will not start the next IO until there are no more buffers
pinned.

Imagine that we started one 16 block IO and have a readahead_distance of
1. After consuming 15 buffers, we will have one more buffer pinned, but
read_stream_look_ahead() will not yet start another IO, due to the
readahead_distance condition (or max_pinned_buffers or ...).  Without the
stream->combine_distance == 1 check, the subsequent check for
read_stream_next_buffer() would consider this a valid case for entering
fast-path.

> > > You mentioned that we don't want to read too far ahead (including for
> > > a single combined IO) in part because:
> > >
> > > > The resowner and private refcount mechanisms take more CPU cycles if you have
> > > > more buffers pinned
> > >
> > > But I don't see how either distance is responding to this or
> > > self-calibrating with this in mind
> >
> > Using the minimal required distance to avoid needing to wait for IO completion
> > is responding to that, no?  Without these patches we read ahead as far as
> > possible, even if all the data is in the page cache, which makes this issue
> > way worse (without these patches it's a major source of regressions in the
> > index prefetching patch).
>
> But we aren't using the minimal distance to avoid needing to wait for
> IO completion. We are also using a higher distance to try and get IO
> combining and  toallow for async copying into the kernel buffer cache,
> etc, etc.

My testing suggests that doing IO combining for a reasonble io_combine_limit
is pretty much always a win in a steady-state stream (i.e. not a short one
that's not fully consumed), the gain from avoiding the larger amounts of
syscalls sufficiently large.

One we start doing async copying from the kernel page cache, we will have to
wait for the completion of that async work, which will lead to
readahead_distance being increased if necessary.

> There's a lot of different considerations; it isn't just two opposing
> forces.

It's not, but I think always performing io_combine_limit sized IOs after a
ramp-up and increasing the distance based on needing to wait is a pretty
decent heuristic.

For best results it does require pgaio_uring_should_use_async() to trigger, as
otherwise we do not get get the parallelized memory copy. Which means it may
never trigger if we don't occasionally reach the size based condition. Luckily
it does not seem like using async is beneficial for small IOs.

> And, I'd imagine that the relationship between the
> number of buffers pinned and CPU cycles required for resowner/refcount
> isn't perfectly linear.

It's definitely not.

> I'm not saying that we don't do IO combining at high distances, I'm
> more saying that it is confusing that combine_distance controls how
> far we look ahead when readahead_distance is low but when
> readahead_distance is high, it controls when we issue the IO and not
> how far we look ahead. I don't think we should change course now, but
> I wanted to call out that this felt a little uncomfortable to me.

I'm not sure I see an alternative.  I tried to at least improve the comments
around this.

Attached are a revised set of commits. The largest changes are:

- Reordered the series to put
  "read_stream: Only increase read-ahead distance when waiting for IO"
  after
  "stream: Split decision about look ahead for AIO and combining"

  Previously I thought it'd be too awkward from a comment perspective, but
  there's only one comment where it is a bit odd.

  Think it's much clearer this way.

- Largely rewrote "Hacky implementation of making read_stream_reset()/end()
  not wait for IO". Looks a lot saner now.

  Think this needs a few more tests, in particular for the read stream and
  foreign_io paths. Will do that in the next version.

- Tried to address most of Bilal's and Melanie's feedback

- Removed some redundant checks from read_stream_should_issue_now()

- Lots of comment polishing, including revising the top-level read_stream.c
  comment

Greetings,

Andres Freund

^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 20:36           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-03 23:10             ` Andres Freund <[email protected]>
  1 sibling, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-03 23:10 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

There are a bunch of heuristics mentioned in the following proposed commit:

On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async.  Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.

> +/*
> + * io_uring executes IO in process context if possible. That's generally good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important.  In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.

Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.

Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.

I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).

(nblocks is in multiples of 8KB)

Multi-core:

nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2075	1.5802
1	1	1	1.0462	6.9652
1	2	0	4.1362	3.4533
1	2	1	1.9284	7.6040
1	4	0	4.0030	7.3720
1	4	1	4.2713	6.9086
1	8	0	4.1653	14.4072
1	8	1	4.3301	13.8365
1	16	0	4.1829	28.9216
1	16	1	4.3006	28.1261
1	32	0	4.0735	59.6232
1	32	1	4.3248	56.1614

I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.

Around an iodepth 4 the loss from async nonexistant or minimal.


2	1	0	5.7289	2.4261
2	1	1	1.8708	7.7466
2	2	0	5.7964	5.0144
2	2	1	3.3749	8.7417
2	4	0	5.8434	10.2023
2	4	1	7.9783	7.3977
2	8	0	5.8166	20.7226
2	8	1	8.2545	14.5431
2	16	0	5.8215	41.6613
2	16	1	8.2354	29.3879
2	32	0	5.6530	86.0286
2	32	1	8.3218	58.3826

With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited.  Latency already starts to be better at iodepth 4.


4	1	0	7.4131	3.8807
4	1	1	3.2133	9.1827
4	2	0	7.3150	8.0854
4	2	1	5.4983	10.8039
4	4	0	7.2784	16.5097
4	4	1	11.2717	10.5699
4	8	0	7.2873	33.2331
4	8	1	16.6299	14.4164
4	16	0	7.1606	67.8777
4	16	1	16.9794	28.4981
4	32	0	6.2954	154.6834
4	32	1	16.3686	59.3610

With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.


8	1	0	8.0403	7.3503
8	1	1	4.6038	12.7202
8	2	0	8.0052	14.9161
8	2	1	8.5176	13.9987
8	4	0	8.1519	29.6698
8	4	1	14.8211	16.1640
8	8	0	7.8525	61.8612
8	8	1	27.5860	17.4434
8	16	0	6.8887	141.3268
8	16	1	34.1307	28.3463
8	32	0	6.9031	282.2350
8	32	1	38.2430	50.7700

With nblocks=8, async is faster already at iodepth 2.


64	1	0	9.1983	52.6768
64	1	1	8.1505	59.5486

128	1	0	7.5442	128.8704
128	1	1	7.3481	132.2355

Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.


This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.



Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.

At iodepth 1 (worse case, a context switch for every IO)

nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2324	1.5692
1	1	1	1.7883	3.9574
2.36x bw regression

2	1	0	5.7914	2.4004
2	1	1	2.9585	4.8417
1.96x bw regression

4	1	0	7.3171	3.9242
4	1	1	4.2450	6.8171
1.7x bw regression

8	1	0	8.1162	7.2674
8	1	1	5.7536	10.2948
1.4x bw regression

16	1	0	8.8559	13.5212
16	1	1	7.1163	16.8277
1.6x bw regression


But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.

Therefore comparing that to a deeper IO depth.

nblocks	iod	async	bw_gib_s	lat_usec
1	16	0	4.1094	29.4339
1	16	1	3.3922	35.7044
1.21x bw regression

2	16	0	5.8381	41.5402
2	16	1	4.8104	50.4571
1.21x bw regression

4	16	0	7.1204	68.2612
4	16	1	5.6479	86.0973
1.26x bw regression

8	16	0	7.0780	137.5520
8	16	1	6.1687	157.8805
1.14x bw regression

16	16	0	7.4523	261.4281
16	16	1	6.7192	290.0837
1.10x bw regression


This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.


I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't.  With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.

Seems ok to me.


I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).

Zen 4 Laptop:
nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	6.0989	1.1408
1	1	1	1.4477	5.1246
1	2	0	6.9600	2.0827
1	2	1	2.8750	5.1711
1	4	0	7.0283	4.2307
1	4	1	8.9174	3.3169

Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.


4	1	0	14.5638	1.9616
4	1	1	5.1245	5.8016
4	2	0	14.8867	3.9607
4	2	1	12.1841	4.8662
4	4	0	14.8678	8.0764
4	4	1	21.5077	5.5417

Similar.


16	1	0	21.0754	5.5891
16	1	1	12.6180	9.4753
16	2	0	20.2770	11.8353
16	2	1	24.3277	9.8172

At nblocks=16, iod=2 starts already starts to be faster.



Greetings,

Andres Freund


Attachments:

  [text/x-python] bench_async_uring.py (3.0K, 2-bench_async_uring.py)
  download | inline:
#!/usr/bin/env python3
import argparse
import json
import subprocess
import sys


def run_fio(directory, nblocks, iodepth, force_async,
            size,
            runtime):
    bs = nblocks * 8 * 1024
    cmd = [
        "fio",
        f"--directory={directory}",
        f"--size={size}",
        "--name=read",
        "--invalidate=0",
        "--rw=read",
        "--direct=0",
        "--buffered=1",
        "--time_based=1",
        f"--runtime={runtime}",
        "--ioengine=io_uring",
        f"--iodepth={iodepth}",
        f"--force_async={force_async}",
        f"--bs={bs}",
        "--output-format=json",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(result.stdout)


def extract_metrics(data):
    """
    Extract bandwidth (GiB/s) and average latency (µs) from fio JSON.
    fio JSON reports.:
    """
    read_stats = data["jobs"][0]["read"]

    bw_kibs = read_stats["bw"]              # KiB/s
    bw_gibs = bw_kibs / 1024**2            # KiB/s → GiB/s

    lat_ns = read_stats["lat_ns"]["mean"]  # nanoseconds
    lat_usec = lat_ns / 1000.0             # → µs

    return bw_gibs, lat_usec


def main():
    parser = argparse.ArgumentParser(
        description="Run fio sequential read benchmarks across parameter combos."
    )
    parser.add_argument(
        "--directory", default="/srv/fio",
        help="fio test directory (default: /srv/fio)",
    )
    parser.add_argument(
        "--size", default="4GiB",
        help="fio file size (default: 4GiB)",
    )
    parser.add_argument(
        "--runtime", type=int, default=1,
        help="Seconds per test (default: 1)",
    )
    parser.add_argument(
        "--nblocks", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32, 64, 128],
        help="Block-count values to test (bs = nblocks * 8 KiB)",
    )
    parser.add_argument(
        "--iodepths", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32],
        help="iodepth values to test",
    )
    args = parser.parse_args()

    print("nblocks\tiod\tasync\tbw_gib_s\tlat_usec")

    for nblocks in args.nblocks:
        for iodepth in args.iodepths:
            for force_async in [0, 1]:
                try:
                    data = run_fio(
                        directory=args.directory,
                        nblocks=nblocks,
                        iodepth=iodepth,
                        force_async=force_async,
                        size=args.size,
                        runtime=args.runtime,
                    )
                    bw_gibs, lat_usec = extract_metrics(data)
                    print(f"{nblocks}\t{iodepth}\t{force_async}\t{bw_gibs:.4f}\t{lat_usec:.4f}")
                    sys.stdout.flush()
                except subprocess.CalledProcessError as exc:
                    print(f"# ERROR nblocks={nblocks} iod={iodepth} async={force_async}: {exc}",
                          file=sys.stderr)
                    sys.exit(1)


if __name__ == "__main__":
    main()

  [text/tab-separated-values] results_manycore.tsv (2.1K, 3-results_manycore.tsv)
  download | inline:
nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2075	1.5802
1	1	1	1.0462	6.9652
1	2	0	4.1362	3.4533
1	2	1	1.9284	7.6040
1	4	0	4.0030	7.3720
1	4	1	4.2713	6.9086
1	8	0	4.1653	14.4072
1	8	1	4.3301	13.8365
1	16	0	4.1829	28.9216
1	16	1	4.3006	28.1261
1	32	0	4.0735	59.6232
1	32	1	4.3248	56.1614
2	1	0	5.7289	2.4261
2	1	1	1.8708	7.7466
2	2	0	5.7964	5.0144
2	2	1	3.3749	8.7417
2	4	0	5.8434	10.2023
2	4	1	7.9783	7.3977
2	8	0	5.8166	20.7226
2	8	1	8.2545	14.5431
2	16	0	5.8215	41.6613
2	16	1	8.2354	29.3879
2	32	0	5.6530	86.0286
2	32	1	8.3218	58.3826
4	1	0	7.4131	3.8807
4	1	1	3.2133	9.1827
4	2	0	7.3150	8.0854
4	2	1	5.4983	10.8039
4	4	0	7.2784	16.5097
4	4	1	11.2717	10.5699
4	8	0	7.2873	33.2331
4	8	1	16.6299	14.4164
4	16	0	7.1606	67.8777
4	16	1	16.9794	28.4981
4	32	0	6.2954	154.6834
4	32	1	16.3686	59.3610
8	1	0	8.0403	7.3503
8	1	1	4.6038	12.7202
8	2	0	8.0052	14.9161
8	2	1	8.5176	13.9987
8	4	0	8.1519	29.6698
8	4	1	14.8211	16.1640
8	8	0	7.8525	61.8612
8	8	1	27.5860	17.4434
8	16	0	6.8887	141.3268
8	16	1	34.1307	28.3463
8	32	0	6.9031	282.2350
8	32	1	38.2430	50.7700
16	1	0	8.8933	13.4650
16	1	1	6.2728	18.9827
16	2	0	8.9076	27.1436
16	2	1	11.7220	20.5300
16	4	0	8.7233	55.6745
16	4	1	18.7624	25.7505
16	8	0	7.4987	129.8232
16	8	1	34.5575	27.9686
16	16	0	7.3837	263.8333
16	16	1	41.0612	47.2465
16	32	0	7.3259	531.7938
16	32	1	39.9109	97.4890
32	1	0	9.2683	26.0485
32	1	1	7.6606	31.5179
32	2	0	9.0020	53.9343
32	2	1	13.1658	36.4441
32	4	0	7.5486	128.9595
32	4	1	22.4968	43.0776
32	8	0	7.6493	254.6875
32	8	1	39.0059	49.6149
32	16	0	7.5547	515.7824
32	16	1	41.4617	93.7583
32	32	0	6.6947	1162.7013
32	32	1	36.8926	211.2120
64	1	0	9.1983	52.6768
64	1	1	8.1505	59.5486
64	2	0	7.6384	127.3833
64	2	1	13.9811	69.1716
64	4	0	7.5413	258.3375
64	4	1	25.5018	76.2306
64	8	0	7.3730	528.3678
64	8	1	41.5893	93.4699
64	16	0	6.7242	1157.6784
64	16	1	35.1358	221.7170
64	32	0	5.2491	2958.7474
64	32	1	29.6408	526.3284
128	1	0	7.5442	128.8704
128	1	1	7.3481	132.2355
128	2	0	7.5959	256.3192
128	2	1	14.3860	135.3077
128	4	0	7.5891	513.4345
128	4	1	26.2082	148.3515
128	8	0	6.7218	1158.2197
128	8	1	39.5513	196.9944
128	16	0	5.1950	2990.2587
128	16	1	28.7749	542.1595
128	32	0	4.8389	6388.4904
128	32	1	27.5857	1131.2934

  [text/tab-separated-values] results_onecore.tsv (2.1K, 4-results_onecore.tsv)
  download | inline:
nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2324	1.5692
1	1	1	1.7883	3.9574
1	2	0	4.0756	3.5073
1	2	1	2.0574	7.1336
1	4	0	4.0813	7.2245
1	4	1	2.5805	11.5444
1	8	0	4.1485	14.4645
1	8	1	3.1191	19.3035
1	16	0	4.1094	29.4339
1	16	1	3.3922	35.7044
1	32	0	4.1652	58.3173
1	32	1	3.5813	67.8628
2	1	0	5.7914	2.4004
2	1	1	2.9585	4.8417
2	2	0	5.8205	5.0033
2	2	1	3.2484	9.0866
2	4	0	5.8692	10.1507
2	4	1	4.1805	14.3272
2	8	0	5.8241	20.7047
2	8	1	4.5100	26.7865
2	16	0	5.8381	41.5402
2	16	1	4.8104	50.4571
2	32	0	5.7680	84.3214
2	32	1	4.8923	99.4498
4	1	0	7.3171	3.9242
4	1	1	4.2450	6.8171
4	2	0	7.3149	8.0876
4	2	1	4.6114	12.9498
4	4	0	7.3564	16.3417
4	4	1	5.2204	23.0800
4	8	0	7.3753	32.8332
4	8	1	5.5436	43.7378
4	16	0	7.1204	68.2612
4	16	1	5.6479	86.0973
4	32	0	6.2542	155.6801
4	32	1	5.4395	179.0695
8	1	0	8.1162	7.2674
8	1	1	5.7536	10.2948
8	2	0	8.1180	14.7826
8	2	1	5.7472	20.9051
8	4	0	8.0499	30.0124
8	4	1	6.2692	38.6020
8	8	0	7.9290	61.2775
8	8	1	6.3000	77.1385
8	16	0	7.0780	137.5520
8	16	1	6.1687	157.8805
8	32	0	6.9722	279.4301
8	32	1	6.2175	313.5523
16	1	0	8.8559	13.5212
16	1	1	7.1163	16.8277
16	2	0	8.8395	27.3402
16	2	1	7.1646	33.7138
16	4	0	8.6280	56.2576
16	4	1	6.8501	70.9189
16	8	0	7.5552	128.8521
16	8	1	6.5925	147.6890
16	16	0	7.4523	261.4281
16	16	1	6.7192	290.0837
16	32	0	7.3669	528.8536
16	32	1	6.7170	580.7891
32	1	0	9.2169	26.2060
32	1	1	8.1352	29.6627
32	2	0	8.9783	54.0723
32	2	1	7.4881	64.8356
32	4	0	7.7601	125.4378
32	4	1	7.0137	138.7156
32	8	0	7.7013	252.9703
32	8	1	7.0259	277.4011
32	16	0	7.6469	509.4715
32	16	1	7.0901	550.0196
32	32	0	6.7442	1154.2708
32	32	1	6.4801	1204.9072
64	1	0	9.1398	53.0560
64	1	1	8.4876	57.0746
64	2	0	7.7871	124.9611
64	2	1	7.2915	133.3649
64	4	0	7.7413	251.6239
64	4	1	7.2876	267.2993
64	8	0	7.7091	505.3741
64	8	1	7.1725	543.5188
64	16	0	6.8242	1140.7509
64	16	1	6.6159	1179.8322
64	32	0	5.1921	2991.8800
64	32	1	5.3160	2934.9295
128	1	0	7.7871	124.8521
128	1	1	7.4671	130.0443
128	2	0	7.6681	253.8789
128	2	1	7.3989	263.0230
128	4	0	7.6174	511.4664
128	4	1	7.3589	529.6611
128	8	0	6.8233	1141.0293
128	8	1	6.6625	1170.2110
128	16	0	5.1618	3009.7684
128	16	1	5.4797	2848.2109
128	32	0	4.8204	6413.3049
128	32	1	4.9751	6274.0151

^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 20:36           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-04 00:06             ` Melanie Plageman <[email protected]>
  2026-04-04 01:24               ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  1 sibling, 1 reply; 23+ messages in thread

From: Melanie Plageman @ 2026-04-04 00:06 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

On Fri, Apr 3, 2026 at 4:36 PM Andres Freund <[email protected]> wrote:
>
> Attached are a revised set of commits.

0001 aio: io_uring: Trigger async processing for large IOs
LGTM

0002 read_stream: Move logic about IO combining & issuing to helpers
LGTM

0003   read stream: Split decision about look ahead for AIO and combining

Stale comment reference to stream->distance around line 505

Typo above ReadStream->combine_distance: "The limits for read-ahead an
combining" should be "and combining".

typo in comment in read_stream_look_ahead()
"even if we are reached the read-ahead limit" I think you meant "even
if we have reached the read-ahead limit"

I didn't really notice this before
     * doing full io_combine_limit sized reads.
     */
    if (flags & READ_STREAM_FULL)
-       stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
+   {
+       stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+       stream->combine_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+   }

that distance started out at io_combine_limit for READ_STREAM_FULL.
That doesn't make that much sense to me, wouldn't we set
readahead_distance to max_pinned_buffers() for READ_STREAM_FULL?

+   /*
+    * Allow looking further ahead if we are in the process of building a
+    * larger IO, the IO is not yet big enough and we don't yet have IO in
+    * flight.  Note that this is allowed even if we are reached the
+    * read-ahead limit (but not the buffer pin limit, combine_distance is
+    * capped by it and we are checking for pinned_buffers == 0).
+    *
+    * The reason this is restricted to not yet having an IO in flight is that
+    * once we are actually reading ahead, we will not issue IOs before they
+    * have reached the full size (or can't be grown further). But we *have*
+    * to issue an IO once pinned_buffers == 0, otherwise there won't be a
+    * buffer to return to the caller.
+    *
+    * This is important for cases where either effective_io_concurrency is
+    * low or we never need to wait for IO and thus are not increasing the
+    * distance. Without this we would end up with lots of small IOs.
+    */
+   if (stream->pending_read_nblocks > 0 &&
+       stream->pinned_buffers == 0 &&
+       stream->pending_read_nblocks < stream->combine_distance)
+       return true;
+

I'd strongly recommend an oxford comma here for clarity:

"larger IO, the IO is not yet big enough and we don't yet have IO in" ->
"larger IO, the IO is not yet big enough, and we don't yet have IO in"

when you say "combine_distance is capped by it" -- what is it here? Do
you mean max_pinned_buffers? It is ambiguous like maybe you could mean
the read-ahead limit you just mentioned

Maybe rephrase that whole parenthetical like
"Note that this can exceed the read-ahead limit. It cannot exceed the
buffer pin limit because pinned_buffers == 0 and combine_distance is
capped by max_pinned_buffers."

And then the next paragraph also is confusing

     * The reason this is restricted to not yet having an IO in flight is that
     * once we are actually reading ahead, we will not issue IOs before they
     * have reached the full size (or can't be grown further). But we *have*
     * to issue an IO once pinned_buffers == 0, otherwise there won't be a
     * buffer to return to the caller.

"restricted to not yet having an IO in flight is awkward"
Maybe just say which part you are referring to (pinned_buffers == 0)

My understanding about the pinned_buffers == 0 criteria, is that we
specifically want to perform the look-ahead even if we would exceed
readahead_distance when this is the first IO -- and we haven't
submitted other I/Os yet. Like when eic is low but we still want to do
read combining. But this comment makes it sound like it is a control
flow issue (e.g. that there "won't be a buffer to return to the
caller")

I asked AI to come up with something better, and this is what it said:

"The pinned_buffers == 0 restriction is important: if we already have
buffers to return to the caller, we can let the normal issue logic in
read_stream_should_issue_now() decide when to submit. But if we have
no pinned buffers, we must keep building the pending IO until it
reaches combine_distance (or can't grow further), because otherwise
we'd submit a small IO and have nothing to return while waiting for
more."

I'm not sure it's 100% correct, but maybe you can rework your version
of the paragraph a bit.

0004 read_stream: Only increase read-ahead distance when waiting for IO

As of this commit READ_STREAM_FULL will never actually get up to the
max_pinned_buffers if needed_wait is never true. That's probably fine,
but it is different than before. Anyway, LGTM.

0005 - Allow read_stream_reset() to not wait for IO completion

I can't comment on whether or not doing this in resowner cleanup is a
good architectural decision, as I don't know enough about that part of
postgres.

    /* Stop looking ahead. */
-   stream->readahead_distance = 0;
-   stream->combine_distance = 0;
+   stream->readahead_distance = -1;
+   stream->combine_distance = -1;

I don't remember why we couldn't have a separate boolean field for
end-of-stream/stream-paused/stream-reset-in-progress? Did you ever
figure it out?

Separately, maybe we should forbid calling pgaio_wref_abandon() on IOs
that haven't been submitted yet. You can get a wref before the IO is
submitted. And I feel like weird stuff could happen if we let people
call pgaio_wref_abandon() on defined or staged IOs. For example,
resowner cleanup may take a very long time because of submitting and
waiting for a bunch of IOs. And IO handles are released before locks,
and I think we don't want to spend a long time waiting cleaning up IO
handles while holding locks.

Then I was thinking maybe you _do_ want to allow calling
pgaio_wref_abandon() on IOs that haven't been submitted because
pgaio_io_release() only works on PGAIO_HS_HANDED_OUT IOs and not on
defined or staged ones

If you don't forbid calling pgaio_wref_abandon on unsubmitted IOs,
then you'd probably want to restructure the code in
ResourceOwnerReleaseInternal() to loop through and submit any
unsubmitted IO first and then wait for IOs after.

pgaio_wref_abandon() does not invalidate the wait ref (which seems
fine to have the caller do it), but you should mention that and make
it clear that it doesn't do anything to the wait ref itself. This
isn't totally clear especially because your comment says

 * Declare that a wait reference to an IO, started by this backend, is not of
 * interest to this backend anymore.

really what it is doing is declaring that the caller is no longer
interested in receiving the IO result

and a weird future caller might want to abandon checking the result
and still wait for completion separately on the wait ref, so making it
sound like this function disables that isn't right either

In pgaio_wref_abandon(), it seems like you could get rid of this guard
        if (ioh->report_return)
for clarity. in the context of pgaio_wref_abandon() it shouldn't ever
be NULL, right? Then it doesn't save you anything and implying it
could be NULL here is confusing

+   HOLD_INTERRUPTS();
+
+   /*
+    * It is safe to perform this check before checking if the IO was recycled
+    * as the owner of an IO cannot change.
+    */
+   if (!am_owner)
+       elog(ERROR, "only IOs owned by current backend can be abandoned");
+

no reason for the error to be after the HOLD_INTERRUPTS(), right? I
thought it made it harder to understand the comment about why we need
to hold interrupts here and if it can just be moved above, then that's
better

In AbandonReadBuffers() header comment:
   "This needs to called" -> "This needs to be called"

I do find it a little weird that if someone calls AbandonReadBuffers()
and doesn't remember to release the buffer pin, it won't get
released...
and in our case, it doesn't just happen in the caller -- it happens in
the caller's caller

I wonder if we are going to want the option for someone to abandon all
the buffers in read_stream_pause()? Right now if you call
read_stream_reset() it overrides the resume_distance, so if you wanted
to restart the stream abandoning all existing IOs but keep up the
distance, you won't be able to do that. It will revert down to 1.

@@ -538,7 +539,7 @@ read_stream_should_issue_now(ReadStream *stream)
     * If the callback has signaled end-of-stream, start the pending read
     * immediately. There is no further potential for IO combining.
     */
-   if (stream->readahead_distance == 0)
+   if (stream->readahead_distance <= 0)
        return true;

I don't think you should get here when readahead_distance < 0. Maybe
we should assert that?

I didn't look at any of the test stuff yet.

- Melanie

^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 20:36           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-04 00:06             ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-04 01:24               ` Andres Freund <[email protected]>
  2026-04-05 05:24                 ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 23+ messages in thread

From: Andres Freund @ 2026-04-04 01:24 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

On 2026-04-03 20:06:32 -0400, Melanie Plageman wrote:
> On Fri, Apr 3, 2026 at 4:36 PM Andres Freund <[email protected]> wrote:
> typo in comment in read_stream_look_ahead()
> "even if we are reached the read-ahead limit" I think you meant "even
> if we have reached the read-ahead limit"
>
> I didn't really notice this before
>      * doing full io_combine_limit sized reads.
>      */
>     if (flags & READ_STREAM_FULL)
> -       stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
> +   {
> +       stream->readahead_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> +       stream->combine_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> +   }
>
> that distance started out at io_combine_limit for READ_STREAM_FULL.
> That doesn't make that much sense to me, wouldn't we set
> readahead_distance to max_pinned_buffers() for READ_STREAM_FULL?

Yea, I agree, it doesn't seem quite right to me. I raised that a while ago
too. But it's something nontrivial that I think should be changed separately.


> +   /*
> +    * Allow looking further ahead if we are in the process of building a
> +    * larger IO, the IO is not yet big enough and we don't yet have IO in
> +    * flight.  Note that this is allowed even if we are reached the
> +    * read-ahead limit (but not the buffer pin limit, combine_distance is
> +    * capped by it and we are checking for pinned_buffers == 0).
> +    *
> +    * The reason this is restricted to not yet having an IO in flight is that
> +    * once we are actually reading ahead, we will not issue IOs before they
> +    * have reached the full size (or can't be grown further). But we *have*
> +    * to issue an IO once pinned_buffers == 0, otherwise there won't be a
> +    * buffer to return to the caller.
> +    *
> +    * This is important for cases where either effective_io_concurrency is
> +    * low or we never need to wait for IO and thus are not increasing the
> +    * distance. Without this we would end up with lots of small IOs.
> +    */
> +   if (stream->pending_read_nblocks > 0 &&
> +       stream->pinned_buffers == 0 &&
> +       stream->pending_read_nblocks < stream->combine_distance)
> +       return true;
> +
>
> I'd strongly recommend an oxford comma here for clarity:
>
> "larger IO, the IO is not yet big enough and we don't yet have IO in" ->
> "larger IO, the IO is not yet big enough, and we don't yet have IO in"

Done.


> when you say "combine_distance is capped by it" -- what is it here? Do
> you mean max_pinned_buffers? It is ambiguous like maybe you could mean
> the read-ahead limit you just mentioned

max_pinned_buffers, yes.


> Maybe rephrase that whole parenthetical like
> "Note that this can exceed the read-ahead limit. It cannot exceed the
> buffer pin limit because pinned_buffers == 0 and combine_distance is
> capped by max_pinned_buffers."

> And then the next paragraph also is confusing
>
>      * The reason this is restricted to not yet having an IO in flight is that
>      * once we are actually reading ahead, we will not issue IOs before they
>      * have reached the full size (or can't be grown further). But we *have*
>      * to issue an IO once pinned_buffers == 0, otherwise there won't be a
>      * buffer to return to the caller.
>
> "restricted to not yet having an IO in flight is awkward"
> Maybe just say which part you are referring to (pinned_buffers == 0)

(I rewrote the entire comment to make the goal clear, based on your question
below)


> My understanding about the pinned_buffers == 0 criteria, is that we
> specifically want to perform the look-ahead even if we would exceed
> readahead_distance when this is the first IO -- and we haven't
> submitted other I/Os yet.

This is much more common than just the first IO.  If we end up not needing to
wait for IO (e.g. because the scan does something CPU intensive with the
returned pages, or because io_method=io_uring executes IO synchronously, or
...), readahead_distance just stays at 1.  But we do want to do IO combining
to the full extent, and may do so, because combine_distance ==
io_combine_limit after a few IOs.

If readahead_distance == 1, read_tream_next_buffer()->read_stream_look_ahead()
will not start another IO until pinned_buffers == 0.  Without the
combine_distance logic, we would then issue an IO of exactly size 1. But with
the combine_distance logic, we will at that point build a combine_distance
sized IO.

We should not do this once pinned_buffers > 0:

- It's not needed, because we won't issue the pending IO until it is big
  enough, as long as pinned_buffers > 0.  That may mean that IO concurrency is
  "too low", but if so, we will notice that due to WaitReadBuffers() returning
  true, leading to an increased readahead_distance.

- It'd lead to reading further ahead than there's a need to. If we e.g. have a
  readahead_distance of 16, and already have 15 buffers pinned,
  read_stream_should_look_ahead() will start a pending read, but (with the
  code in the patch or in master) won't actually issue the read, until the
  caller has consumed more of the buffers (reducing max_pinned_buffers,
  allowing to increase the size of the pending read).  We'd do this until we
  either grew the IO to the max size or the next block can't be combined.

  What's the argument for issuing eagerly in this case?




> 0005 - Allow read_stream_reset() to not wait for IO completion
>
> I can't comment on whether or not doing this in resowner cleanup is a
> good architectural decision, as I don't know enough about that part of
> postgres.

I think I'm pretty OK with that part now, after chewing on it for a good
while.

We already did things like TerminateBufferIO() as part of resowner cleanup
before AIO (c.f. ResOwnerReleaseBufferIO()->AbortBufferIO()). Waiting for
in-flight IOs is basically the equivalent thing in an AIO world.


>     /* Stop looking ahead. */
> -   stream->readahead_distance = 0;
> -   stream->combine_distance = 0;
> +   stream->readahead_distance = -1;
> +   stream->combine_distance = -1;
>
> I don't remember why we couldn't have a separate boolean field for
> end-of-stream/stream-paused/stream-reset-in-progress? Did you ever
> figure it out?

I don't think there really is a good reason, but it's a not entirely trivial
change.


> Separately, maybe we should forbid calling pgaio_wref_abandon() on IOs
> that haven't been submitted yet.

Yea, that should probably assert out. I can't see any reason for that to be
needed.


> And I feel like weird stuff could happen if we let people call
> pgaio_wref_abandon() on defined or staged IOs. For example, resowner cleanup
> may take a very long time because of submitting and waiting for a bunch of
> IOs. And IO handles are released before locks, and I think we don't want to
> spend a long time waiting cleaning up IO handles while holding locks.

I don't find that a concern with the current users of AIO - outside of error
paths we simply never have IOs that live beyond the resowner.

I think the argument for holding a lock is actually a *pro* argument for
waiting during resowner cleanup: That way the AIO subsystem will not hold onto
buffer pins for relations that the issuing bckend does not hold a lock for.


> Then I was thinking maybe you _do_ want to allow calling
> pgaio_wref_abandon() on IOs that haven't been submitted because
> pgaio_io_release() only works on PGAIO_HS_HANDED_OUT IOs and not on
> defined or staged ones

I don't know what the potential use case for that would be. Why would you
stage an IO that you then immediately (after all you're not allowed to do very
much when in batch mode) want to abandon.  If that's needed, just submit the
IO.


> pgaio_wref_abandon() does not invalidate the wait ref (which seems
> fine to have the caller do it), but you should mention that and make
> it clear that it doesn't do anything to the wait ref itself.

Hm. We don't state that either for pgaio_wref_wait().

I wonder if we should make the iow argument for pgaio_wref_wait(),
pgaio_wref_check_done(), pgaio_wref_abandon(), pgaio_wref_valid(),
pgaio_wref_get_id() const to make that clear on a type-system level?


> This isn't totally clear especially because your comment says

>  * Declare that a wait reference to an IO, started by this backend, is not of
>  * interest to this backend anymore.
>
> really what it is doing is declaring that the caller is no longer
> interested in receiving the IO result

Do you think we'd need a comment in addition to the const?


> and a weird future caller might want to abandon checking the result
> and still wait for completion separately on the wait ref, so making it
> sound like this function disables that isn't right either

It's not even that crazy to do that and it's kinda done today: Even after
abandoning the IO as part of a read stream reset, we might need to wait for
the buffer to be ready as part of a different ReadBuffersOperation().


> In pgaio_wref_abandon(), it seems like you could get rid of this guard
>         if (ioh->report_return)
> for clarity. in the context of pgaio_wref_abandon() it shouldn't ever
> be NULL, right? Then it doesn't save you anything and implying it
> could be NULL here is confusing

It can be NULL here. It's legit to simply pass ret=NULL to
pgaio_io_acquire_nb().

But of course that doesn't mean we can't unconditionally set it to NULL. I'm
slightly hesitant to do that just in pgaio_wref_abandon(), as
pgaio_io_release_resowner() already has the NULL check before this commit.


> +   HOLD_INTERRUPTS();
> +
> +   /*
> +    * It is safe to perform this check before checking if the IO was recycled
> +    * as the owner of an IO cannot change.
> +    */
> +   if (!am_owner)
> +       elog(ERROR, "only IOs owned by current backend can be abandoned");
> +
>
> no reason for the error to be after the HOLD_INTERRUPTS(), right?

Not that I can see.


> I thought it made it harder to understand the comment about why we need to
> hold interrupts here and if it can just be moved above, then that's better

K.


> In AbandonReadBuffers() header comment:
>    "This needs to called" -> "This needs to be called"
>
> I do find it a little weird that if someone calls AbandonReadBuffers()
> and doesn't remember to release the buffer pin, it won't get
> released...

You do get a warning about it if you do forget (about buffer pins not being
released at the end of the statement).  I don't think it'd make any code
simpler if we made AbandonReadBuffers() release the pins, as read_stream.c
already needs to think about the buffer pins it holds for buffers it did not
abandon (e.g. because no IO was necessary).  I think the control flow to
transport that upwards whether we need to release pins or not would end up
quite awkward.




> I wonder if we are going to want the option for someone to abandon all
> the buffers in read_stream_pause()?

I can't see what that would be good for?  The buffers that would be returned
by the read stream wouldn't necessarily be BM_VALID, and I don't think code
outside of low level stuff like bufmgr.c and read_stream.c should have to deal
with !BM_VALID buffers.


> Right now if you call read_stream_reset() it overrides the resume_distance,
> so if you wanted to restart the stream abandoning all existing IOs but keep
> up the distance, you won't be able to do that. It will revert down to 1.

I think if that's an issue we'd want to have a reset variant that keeps the
distance, not make pause extremely hard to use...


> @@ -538,7 +539,7 @@ read_stream_should_issue_now(ReadStream *stream)
>      * If the callback has signaled end-of-stream, start the pending read
>      * immediately. There is no further potential for IO combining.
>      */
> -   if (stream->readahead_distance == 0)
> +   if (stream->readahead_distance <= 0)
>         return true;
>
> I don't think you should get here when readahead_distance < 0. Maybe
> we should assert that?

I think we do right now, if there is an incomplete pending read.  It looks not
entirely trivial to change that...


Thanks for the review!


Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 17:30       ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 19:04         ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-03 20:36           ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-04 00:06             ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-04 01:24               ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
@ 2026-04-05 05:24                 ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-05 05:24 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

I polished the first four commits a good bit more.  Very little in the way of
code changes, other than comments.

Will send a version of
    "Allow read_stream_reset() to not wait for IO completion"
with extended tests tomorrow.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

* Re: AIO / read stream heuristics adjustments for index prefetching
  2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-02 14:31 ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
  2026-04-02 15:47   ` Re: AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
  2026-04-03 15:46     ` Re: AIO / read stream heuristics adjustments for index prefetching Melanie Plageman <[email protected]>
@ 2026-04-03 20:41       ` Andres Freund <[email protected]>
  1 sibling, 0 replies; 23+ messages in thread

From: Andres Freund @ 2026-04-03 20:41 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: pgsql-hackers; Thomas Munro <[email protected]>; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>

Hi,

Forgot to send this one earlier:

On 2026-04-03 11:46:59 -0400, Melanie Plageman wrote:
> On Thu, Apr 2, 2026 at 11:47 AM Andres Freund <[email protected]> wrote:
> >
> > What do you think about the updated patch to achieve that that I posted?
> 
> here is some review on 0005 and 0006 earlier posted
> concrete things:
> -------
> - I’d reorder stream→distance == 0 in read_stream_look_ahead() (and
> issue), I found myself asking why it wasn’t first

It'd not be correct in should_issue_now(). We can't move it before

	/* there is no pending IO that could be issued */
	if (pending_read_nblocks == 0)
		return false;

because then we'd trigger a call to read_stream_start_pending_read(), without
there being a pending read.

We can't move it before

	/* never start more IOs than our cap */
	if (stream->ios_in_progress >= stream->max_ios)
		return false;

because that could lead us to exceeding max_ios.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 23+ messages in thread

end of thread, other threads:[~2026-04-05 05:24 UTC | newest]

Thread overview: 23+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-31 16:02 AIO / read stream heuristics adjustments for index prefetching Andres Freund <[email protected]>
2026-03-31 19:31 ` Melanie Plageman <[email protected]>
2026-03-31 20:59 ` Melanie Plageman <[email protected]>
2026-04-01 23:20   ` Andres Freund <[email protected]>
2026-04-02 12:30     ` Nazir Bilal Yavuz <[email protected]>
2026-04-02 13:33       ` Andres Freund <[email protected]>
2026-04-03 16:45         ` Melanie Plageman <[email protected]>
2026-04-03 19:01           ` Andres Freund <[email protected]>
2026-04-01 14:52 ` Melanie Plageman <[email protected]>
2026-04-01 15:49   ` Andres Freund <[email protected]>
2026-04-02 14:31 ` Melanie Plageman <[email protected]>
2026-04-02 15:47   ` Andres Freund <[email protected]>
2026-04-02 21:13     ` Melanie Plageman <[email protected]>
2026-04-02 21:30       ` Andres Freund <[email protected]>
2026-04-03 15:46     ` Melanie Plageman <[email protected]>
2026-04-03 17:30       ` Andres Freund <[email protected]>
2026-04-03 19:04         ` Melanie Plageman <[email protected]>
2026-04-03 20:36           ` Andres Freund <[email protected]>
2026-04-03 23:10             ` Andres Freund <[email protected]>
2026-04-04 00:06             ` Melanie Plageman <[email protected]>
2026-04-04 01:24               ` Andres Freund <[email protected]>
2026-04-05 05:24                 ` Andres Freund <[email protected]>
2026-04-03 20:41       ` Andres Freund <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox