Don't synchronously wait for already-in-progress IO in read stream

public inbox for [email protected]  
help / color / mirror / Atom feed

Don't synchronously wait for already-in-progress IO in read stream
31+ messages / 6 participants
[nested] [flat]

* Don't synchronously wait for already-in-progress IO in read stream
@ 2025-09-11 21:46 Andres Freund <[email protected]>
  2025-11-09 19:51 ` Re: Don't synchronously wait for already-in-progress IO in read stream Peter Geoghegan <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  0 siblings, 2 replies; 31+ messages in thread

From: Andres Freund @ 2025-09-11 21:46 UTC (permalink / raw)
  To: pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>; Thomas Munro <[email protected]>

Hi,

In the index prefetching thread we discovered that read stream performance
suffers rather substantially when a read stream is reading blocks multiple
times within the readahead distance.

The problem leading to that is that we are currently synchronously waiting for
IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
IO. If a block is read twice, that means we won't actually have enough IOs in
flight to have good performance. What's worse, currently the wait is not
attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
IO).

This does not commonly occur with in-tree users of read streams, as users like
seqscans, bitmap heap scans, vacuum, ... will never try to read the same block
twice. However with index prefetching that is a more common case.

It is possible to encounter a version of this in 18/master: If multiple scans
for the same table are in progress, they can end up waiting synchronously for
each other. However it's a much less severe issue, as the scan that is
"further ahead" will not be blocked.


To fix it, the attached patch has AsyncReadBuffers() check if the "target"
buffer already has IO in progress. If so, it assing the buffer's IO wait
reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait
for the IO. To make that work correctly, the buffer stats etc have to be
updated in that case in WaitReadBuffers().


I did not feel like I was sufficiently confident in making this work without
tests. However, it's not exactly trivial to test some versions of this, due to
the way multiple processes need to be coordinated. It took way way longer to
write tests than to make the code actually work :/.

The attached tests add a new read_stream_for_blocks() function to test_aio. I
found it also rather useful to reproduce the performance issue without the
index prefetching patch applied.  To test the cross process case the injection
point infrastructure in test_aio had to be extended a bit.


Attached are three patches:

0001: Introduces a TestAio package and splits out some existing tests out of
      001_aio.pl

0002: Add read_stream test infrastructure & tests

      Note that the tests don't test that we don't unnecessarily wait, as
      described above, just that IO works correctly.

0003: Improve performance of read stream when re-encountering blocks


To reproduce the issue, the read_stream_for_blocks() function added to
test_aio can be used, in combination with debug_io_direct=data (it's also
possible without DIO, it'd just be more work).

prep:
CREATE EXTENSION  test_aio;
CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i);

test:
SELECT pg_buffercache_evict_relation('large');
EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM generate_series(1, 10000) g(i)));


Without 0003 applied:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                    │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254 rows=40000.00 loops=1)          │
│   Buffers: shared hit=29997 read=10003                                                                                                          │
│   I/O Timings: shared read=16.116                                                                                                               │
│   InitPlan 1                                                                                                                                    │
│     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1)                                         │
│           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1)                               │
│                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.554..0.988 rows=10000.00 loops=1) │
│ Planning Time: 0.060 ms                                                                                                                         │
│ Execution Time: 676.436 ms                                                                                                                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 676.753 ms


With 0003:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                    │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453 rows=40000.00 loops=1)            │
│   Buffers: shared hit=29997 read=10003                                                                                                          │
│   I/O Timings: shared read=50.467                                                                                                               │
│   InitPlan 1                                                                                                                                    │
│     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1)                                         │
│           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1)                               │
│                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.539..0.965 rows=10000.00 loops=1) │
│ Planning Time: 0.046 ms                                                                                                                         │
│ Execution Time: 90.661 ms                                                                                                                       │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 90.887 ms

Greetings,

Andres Freund


^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2025-11-09 19:51 ` Peter Geoghegan <[email protected]>
  1 sibling, 0 replies; 31+ messages in thread

From: Peter Geoghegan @ 2025-11-09 19:51 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Tomas Vondra <[email protected]>; Thomas Munro <[email protected]>

On Thu, Sep 11, 2025 at 5:46 PM Andres Freund <[email protected]> wrote:
> The problem leading to that is that we are currently synchronously waiting for
> IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
> IO. If a block is read twice, that means we won't actually have enough IOs in
> flight to have good performance. What's worse, currently the wait is not
> attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
> IO).

This patch no longer cleanly applies. Can you post a new version?

Thanks
-- 
Peter Geoghegan





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2025-11-09 22:20 ` Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Thomas Munro @ 2025-11-09 22:20 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Fri, Sep 12, 2025 at 9:46 AM Andres Freund <[email protected]> wrote:
+     * It's possible that another backend starts IO on the buffer between this
+     * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+     * we will synchronously wait for the IO below, but the window for that is
+     * small enough that it won't happen often enough to have a significant
+     * performance impact.
+     */
+    if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
...
     /*
      * Check if we can start IO on the first to-be-read buffer.
      *
-     * If an I/O is already in progress in another backend, we want to wait
-     * for the outcome: either done, or something went wrong and we will
-     * retry.
+     * If a synchronous I/O is in progress in another backend (it can't be
+     * this backend), we want to wait for the outcome: either done, or
+     * something went wrong and we will retry.
      */
     if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))

"..., or an asynchronous IO was started after
ReadBuffersIOAlreadyInProgress() (unlikely), ..."?

I suppose (or perhaps vaguely recall from an off-list discussion?)
that you must have considered merging the new
is-it-already-in-progress check into ReadBuffersCanStartIO().  I
suppose the nowait argument would become a tri-state argument with a
value that means "don't wait for an in-progress read, just give me the
IO handle so I can 'join' it as a foreign waiter", with a new output
argument to receive the handle, or something along those lines, and I
guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
name.  That'd remove the double-check (extra header lock-unlock cycle)
and associated race that can cause that rare synchronous wait (which
must still happen sometimes in the duelling concurrent scan use
case?), at the slight extra cost of having to allocate and free a
handle in the case of repeated blocks (eg the index->heap scan use
case), but at least that's just backend-local list pushups and doesn't
do extra work otherwise.  Is there some logical problem with that
approach?  Is the code just too clumsy?

^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
@ 2026-01-23 21:03   ` Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-02-23 19:27     ` Re: Don't synchronously wait for already-in-progress IO in read stream Peter Geoghegan <[email protected]>
  0 siblings, 2 replies; 31+ messages in thread

From: Melanie Plageman @ 2026-01-23 21:03 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <[email protected]> wrote:
>
> I suppose (or perhaps vaguely recall from an off-list discussion?)
> that you must have considered merging the new
> is-it-already-in-progress check into ReadBuffersCanStartIO().  I
> suppose the nowait argument would become a tri-state argument with a
> value that means "don't wait for an in-progress read, just give me the
> IO handle so I can 'join' it as a foreign waiter", with a new output
> argument to receive the handle, or something along those lines, and I
> guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
> name.  That'd remove the double-check (extra header lock-unlock cycle)
> and associated race that can cause that rare synchronous wait (which
> must still happen sometimes in the duelling concurrent scan use
> case?), at the slight extra cost of having to allocate and free a
> handle in the case of repeated blocks (eg the index->heap scan use
> case), but at least that's just backend-local list pushups and doesn't
> do extra work otherwise.  Is there some logical problem with that
> approach?  Is the code just too clumsy?

Attached v3 basically does what you suggested above. Now, we should
only have to wait if the backend encounters a buffer after another
backend has set BM_IO_IN_PROGRESS but before that other backend has
set the buffer descriptor's wait reference.

0001 and 0002 are Andres' test-related patches. 0003 is a change I
think is required to make one of the tests stable (esp on the BSDs).
0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
IO concept but with your suggested structure and my suggested styling.
I could potentially break out more into smaller refactoring commits,
but I don't think it's too bad the way it is.

A few things about the patch that I'm not sure about:

- I don't know if pgaio_submit_staged() is in all the right places
(and not in too many places). I basically do it before we would wait
when starting read IO on the buffer. In the permanent buffers case,
that's now only when BM_IO_IN_PROGRESS is set but the wait reference
isn't valid yet. This can't happen in the temporary buffers case, so
I'm not sure we need to call pgaio_submit_staged().

- StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
We could refactor it so that it works for AsyncReadBuffers(), but that
would involve returning something that distinguishes between
IO_IN_PROGRESS and IO already done.  And StartBufferIO()'s comment
explicitly says it wants to avoid that.
If we keep my structure, with AsyncReadBuffers() using its own helper
(PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
like we need some way to make it clear what StartBufferIO() is for.
I'm not sure what would collectively describe its current users,
though. It also now has no non-test callers passing nowait as true.
However, once we add write combining, it will, so it seems like we
should leave it the way it is to avoid churn. However, other
developers might be confused in the interim.

- In the 004_read_stream tests, I wonder if there is a way to test
that we don't wait for foreign IO until WaitReadBuffers(). We have
tests for the stream accessing the same block, which in some cases
will exercise the foreign IO path. But it doesn't distinguish between
the old behavior -- waiting for the IO to complete when starting read
IO on it -- and the new behavior -- not waiting until
WaitReadBuffers(). That may not be possible to test, though.

- Melanie


Attachments:

  [text/x-patch] v3-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch (10.7K, 2-v3-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch)
  download | inline diff:
From 1340e52fe88ebddfabcd8285e4bcc48ca21722ed Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 9 Sep 2025 10:14:34 -0400
Subject: [PATCH v3 1/5] aio: Refactor tests in preparation for more tests

In a future commit more AIO related tests are due to be introduced. However
001_aio.pl already is fairly large.

This commit introduces a new TestAio package with helpers for writing AIO
related tests. Then it uses the new helpers to simplify the existing
001_aio.pl by iterating over all supported io_methods. This will be
particularly helpful because additional methods already have been submitted.

Additionally this commit splits out testing of initdb using a non-default
method into its own test. While that test is somewhat important, it's fairly
slow and doesn't break that often. For development velocity it's helpful for
001_aio.pl to be faster.

While particularly the latter could benefit from being its own commit, it
seems to introduce more back-and-forth than it's worth.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/modules/test_aio/meson.build     |   1 +
 src/test/modules/test_aio/t/001_aio.pl    | 140 +++++++---------------
 src/test/modules/test_aio/t/003_initdb.pl |  71 +++++++++++
 src/test/modules/test_aio/t/TestAio.pm    |  90 ++++++++++++++
 4 files changed, 203 insertions(+), 99 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/003_initdb.pl
 create mode 100644 src/test/modules/test_aio/t/TestAio.pm

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index fefa25bc5ab..18a797f3a3b 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -32,6 +32,7 @@ tests += {
     'tests': [
       't/001_aio.pl',
       't/002_io_workers.pl',
+      't/003_initdb.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 5c634ec3ca9..27ee96898e0 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -7,126 +7,55 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use FindBin;
+use lib $FindBin::RealBin;
 
-###
-# Test io_method=worker
-###
-my $node_worker = create_node('worker');
-$node_worker->start();
-
-test_generic('worker', $node_worker);
-SKIP:
-{
-	skip 'Injection points not supported by this build', 1
-	  unless $ENV{enable_injection_points} eq 'yes';
-	test_inject_worker('worker', $node_worker);
-}
+use TestAio;
 
-$node_worker->stop();
+my %nodes;
 
 
 ###
-# Test io_method=io_uring
+# Create and configure one instance for each io_method
 ###
 
-if (have_io_uring())
+foreach my $method (TestAio::supported_io_methods())
 {
-	my $node_uring = create_node('io_uring');
-	$node_uring->start();
-	test_generic('io_uring', $node_uring);
-	$node_uring->stop();
-}
-
-
-###
-# Test io_method=sync
-###
-
-my $node_sync = create_node('sync');
+	my $node = PostgreSQL::Test::Cluster->new($method);
 
-# just to have one test not use the default auto-tuning
+	$nodes{$method} = $node;
+	$node->init();
+	$node->append_conf('postgresql.conf', "io_method=$method");
+	TestAio::configure($node);
+}
 
-$node_sync->append_conf(
+# Just to have one test not use the default auto-tuning
+$nodes{'sync'}->append_conf(
 	'postgresql.conf', qq(
-io_max_concurrency=4
+ io_max_concurrency=4
 ));
 
-$node_sync->start();
-test_generic('sync', $node_sync);
-$node_sync->stop();
-
-done_testing();
-
 
 ###
-# Test Helpers
+# Execute the tests for each io_method
 ###
 
-sub create_node
+foreach my $method (TestAio::supported_io_methods())
 {
-	local $Test::Builder::Level = $Test::Builder::Level + 1;
-
-	my $io_method = shift;
+	my $node = $nodes{$method};
 
-	my $node = PostgreSQL::Test::Cluster->new($io_method);
-
-	# Want to test initdb for each IO method, otherwise we could just reuse
-	# the cluster.
-	#
-	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
-	# options specified by ->extra, if somebody puts -c io_method=xyz in
-	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
-	# detect it.
-	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
-	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
-		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
-	{
-		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
-	}
-
-	$node->init(extra => [ '-c', "io_method=$io_method" ]);
-
-	$node->append_conf(
-		'postgresql.conf', qq(
-shared_preload_libraries=test_aio
-log_min_messages = 'DEBUG3'
-log_statement=all
-log_error_verbosity=default
-restart_after_crash=false
-temp_buffers=100
-));
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
 
-	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
-	# io_method, it'd override the setting persisted at initdb time. While
-	# using (and later verifying) the setting from initdb provides some
-	# verification of having used the io_method during initdb, it's probably
-	# not worth the complication of only appending if the variable is set in
-	# in TEMP_CONFIG.
-	$node->append_conf(
-		'postgresql.conf', qq(
-io_method=$io_method
-));
+done_testing();
 
-	ok(1, "$io_method: initdb");
 
-	return $node;
-}
+###
+# Test Helpers
+###
 
-sub have_io_uring
-{
-	# To detect if io_uring is supported, we look at the error message for
-	# assigning an invalid value to an enum GUC, which lists all the valid
-	# options. We need to use -C to deal with running as administrator on
-	# windows, the superuser check is omitted if -C is used.
-	my ($stdout, $stderr) =
-	  run_command [qw(postgres -C invalid -c io_method=invalid)];
-	die "can't determine supported io_method values"
-	  unless $stderr =~ m/Available values: ([^\.]+)\./;
-	my $methods = $1;
-	note "supported io_method values are: $methods";
-
-	return ($methods =~ m/io_uring/) ? 1 : 0;
-}
 
 sub psql_like
 {
@@ -1490,8 +1419,8 @@ SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
 }
 
 
-# Run all tests that are supported for all io_methods
-sub test_generic
+# Run all tests that for the specified node / io_method
+sub test_io_method
 {
 	my $io_method = shift;
 	my $node = shift;
@@ -1526,10 +1455,23 @@ CHECKPOINT;
 	test_ignore_checksum($io_method, $node);
 	test_checksum_createdb($io_method, $node);
 
+	# generic injection tests
   SKIP:
 	{
 		skip 'Injection points not supported by this build', 1
 		  unless $ENV{enable_injection_points} eq 'yes';
 		test_inject($io_method, $node);
 	}
+
+	# worker specific injection tests
+	if ($io_method eq 'worker')
+	{
+	  SKIP:
+		{
+			skip 'Injection points not supported by this build', 1
+			  unless $ENV{enable_injection_points} eq 'yes';
+
+			test_inject_worker($io_method, $node);
+		}
+	}
 }
diff --git a/src/test/modules/test_aio/t/003_initdb.pl b/src/test/modules/test_aio/t/003_initdb.pl
new file mode 100644
index 00000000000..c03ae58d00a
--- /dev/null
+++ b/src/test/modules/test_aio/t/003_initdb.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test initdb for each IO method. This is done separately from 001_aio.pl, as
+# it isn't fast. This way the more commonly failing / hacked-on 001_aio.pl can
+# be iterated on more quickly.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	test_create_node($method);
+}
+
+done_testing();
+
+
+sub test_create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	TestAio::configure($node);
+
+	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
+	# io_method, it'd override the setting persisted at initdb time. While
+	# using (and later verifying) the setting from initdb provides some
+	# verification of having used the io_method during initdb, it's probably
+	# not worth the complication of only appending if the variable is set in
+	# in TEMP_CONFIG.
+	$node->append_conf(
+		'postgresql.conf', qq(
+io_method=$io_method
+));
+
+	ok(1, "$io_method: initdb");
+
+	$node->start();
+	$node->stop();
+	ok(1, "$io_method: start & stop");
+
+	return $node;
+}
diff --git a/src/test/modules/test_aio/t/TestAio.pm b/src/test/modules/test_aio/t/TestAio.pm
new file mode 100644
index 00000000000..5bc80a9b130
--- /dev/null
+++ b/src/test/modules/test_aio/t/TestAio.pm
@@ -0,0 +1,90 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+TestAio - helpers for writing AIO related tests
+
+=cut
+
+package TestAio;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item TestAio::supported_io_methods()
+
+Return an array of all the supported values for the io_method GUC
+
+=cut
+
+sub supported_io_methods()
+{
+	my @io_methods = ('worker');
+
+	push(@io_methods, "io_uring") if have_io_uring();
+
+	# Return sync last, as it will least commonly fail
+	push(@io_methods, "sync");
+
+	return @io_methods;
+}
+
+
+=item TestAio::configure()
+
+Prepare a cluster for AIO test
+
+=cut
+
+sub configure
+{
+	my $node = shift;
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+}
+
+
+=pod
+
+=item TestAio::have_io_uring()
+
+Return if io_uring is supported
+
+=cut
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+1;
-- 
2.43.0



  [text/x-patch] v3-0002-test_aio-Add-read_stream-test-infrastructure-test.patch (22.8K, 3-v3-0002-test_aio-Add-read_stream-test-infrastructure-test.patch)
  download | inline diff:
From a88039dc34144abc6ad742938435538d8dc70f8c Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 10 Sep 2025 14:00:02 -0400
Subject: [PATCH v3 2/5] test_aio: Add read_stream test infrastructure & tests

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/modules/test_aio/meson.build         |   1 +
 .../modules/test_aio/t/004_read_stream.pl     | 282 ++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql   |  26 +-
 src/test/modules/test_aio/test_aio.c          | 344 +++++++++++++++---
 src/tools/pgindent/typedefs.list              |   1 +
 5 files changed, 602 insertions(+), 52 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/004_read_stream.pl

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index 18a797f3a3b..909f81d96c1 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -33,6 +33,7 @@ tests += {
       't/001_aio.pl',
       't/002_io_workers.pl',
       't/003_initdb.pl',
+      't/004_read_stream.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
new file mode 100644
index 00000000000..89cfabbb1d3
--- /dev/null
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -0,0 +1,282 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init();
+
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+max_connections=8
+io_method=worker
+));
+
+$node->start();
+test_setup($node);
+$node->stop();
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	$node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
+
+done_testing();
+
+
+sub test_setup
+{
+	my $node = shift;
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+
+CREATE TABLE largeish(k int not null) WITH (FILLFACTOR=10);
+INSERT INTO largeish(k) SELECT generate_series(1, 10000);
+));
+	ok(1, "setup");
+}
+
+
+sub test_repeated_blocks
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Preventing larger reads makes testing easier
+	$psql->query_safe(
+		qq/
+SET io_combine_limit = 1;
+/);
+
+	# test miss of the same block twice in a row
+	$psql->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
+/);
+	ok(1, "$io_method: stream missing the same block repeatedly");
+
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
+/);
+	ok(1, "$io_method: stream hitting the same block repeatedly");
+
+	# test hit of the same block twice in a row
+	$psql->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 0]);
+/);
+	ok(1, "$io_method: stream accessing same block");
+
+	$psql->quit();
+}
+
+
+sub test_inject_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	my $pid_a = $psql_a->query_safe(qq/SELECT pg_backend_pid();/);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads succeeding.
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>5, nblocks=>1);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	ok(1,
+		qq/$io_method: read stream encounters succeeding IO by another backend/
+	);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads failing.
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'), pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>5, nblocks=>1);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	$psql_b->{run}->pump_nb();
+	like(
+		$psql_b->{stderr},
+		qr/.*ERROR.*could not read blocks 5..5.*$/,
+		"$io_method: injected error occurred");
+	$psql_b->{stderr} = '';
+	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
+
+
+	ok(1,
+		qq/$io_method: read stream encounters failing IO by another backend/);
+
+
+	###
+	# Test read stream encountering two buffers that are undergoing the same
+	# IO, started by another backend
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>2, nblocks=>3);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 4]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,4\}/);
+
+	ok(1, qq/$io_method: read stream encounters two buffer read in one IO/);
+
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+
+sub test_io_method
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	test_repeated_blocks($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject_foreign($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..da7cc03829a 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -33,6 +33,10 @@ CREATE FUNCTION read_rel_block_ll(
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
+CREATE FUNCTION evict_rel(rel regclass)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
 CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
@@ -41,7 +45,7 @@ CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
 RETURNS pg_catalog.int4 STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool, assign_io bool DEFAULT false)
 RETURNS pg_catalog.bool STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
@@ -50,6 +54,14 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+/*
+ * Read stream related functions
+ */
+CREATE FUNCTION read_stream_for_blocks(rel regclass, blocks int4[], OUT blockoff int4, OUT blocknum int4, OUT buf int4)
+RETURNS SETOF record STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Handle related functions
@@ -91,8 +103,16 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 /*
  * Injection point related functions
  */
-CREATE FUNCTION inj_io_short_read_attach(result int)
-RETURNS pg_catalog.void STRICT
+CREATE FUNCTION inj_io_completion_wait(pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_completion_continue()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_attach(result int, pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION inj_io_short_read_detach()
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index b1aa8af9ec0..911a7102a34 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -20,16 +20,23 @@
 
 #include "access/relation.h"
 #include "fmgr.h"
+#include "funcapi.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "storage/read_stream.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/rel.h"
+#include "utils/wait_event.h"
 
 
 PG_MODULE_MAGIC;
@@ -37,13 +44,30 @@ PG_MODULE_MAGIC;
 
 typedef struct InjIoErrorState
 {
+	ConditionVariable cv;
+
 	bool		enabled_short_read;
 	bool		enabled_reopen;
 
+	bool		enabled_completion_wait;
+	Oid			completion_wait_relfilenode;
+	pid_t		completion_wait_pid;
+	uint32		completion_wait_event;
+
 	bool		short_read_result_set;
+	Oid			short_read_relfilenode;
+	pid_t		short_read_pid;
 	int			short_read_result;
 } InjIoErrorState;
 
+typedef struct BlocksReadStreamData
+{
+	int			nblocks;
+	int			curblock;
+	uint32	   *blocks;
+} BlocksReadStreamData;
+
+
 static InjIoErrorState *inj_io_error_state;
 
 /* Shared memory init callbacks */
@@ -85,10 +109,13 @@ test_aio_shmem_startup(void)
 		inj_io_error_state->enabled_short_read = false;
 		inj_io_error_state->enabled_reopen = false;
 
+		ConditionVariableInit(&inj_io_error_state->cv);
+		inj_io_error_state->completion_wait_event = WaitEventInjectionPointNew("completion_wait");
+
 #ifdef USE_INJECTION_POINTS
 		InjectionPointAttach("aio-process-completion-before-shared",
 							 "test_aio",
-							 "inj_io_short_read",
+							 "inj_io_completion_hook",
 							 NULL,
 							 0);
 		InjectionPointLoad("aio-process-completion-before-shared");
@@ -384,7 +411,7 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
 		elog(ERROR, "nblocks is out of range");
 
-	rel = relation_open(relid, AccessExclusiveLock);
+	rel = relation_open(relid, AccessShareLock);
 
 	for (int i = 0; i < nblocks; i++)
 	{
@@ -458,6 +485,27 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(evict_rel);
+Datum
+evict_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	int32		buffers_evicted,
+				buffers_flushed,
+				buffers_skipped;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	EvictRelUnpinnedBuffers(rel, &buffers_evicted, &buffers_flushed,
+							&buffers_skipped);
+
+	relation_close(rel, AccessExclusiveLock);
+
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(invalidate_rel_block);
 Datum
 invalidate_rel_block(PG_FUNCTION_ARGS)
@@ -610,6 +658,86 @@ buffer_call_terminate_io(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+
+static BlockNumber
+read_stream_for_blocks_cb(ReadStream *stream,
+						  void *callback_private_data,
+						  void *per_buffer_data)
+{
+	BlocksReadStreamData *stream_data = callback_private_data;
+
+	if (stream_data->curblock >= stream_data->nblocks)
+		return InvalidBlockNumber;
+	return stream_data->blocks[stream_data->curblock++];
+}
+
+PG_FUNCTION_INFO_V1(read_stream_for_blocks);
+Datum
+read_stream_for_blocks(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *blocksarray = PG_GETARG_ARRAYTYPE_P(1);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Relation	rel;
+	BlocksReadStreamData stream_data;
+	ReadStream *stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/*
+	 * We expect the input to be an N-element int4 array; verify that. We
+	 * don't need to use deconstruct_array() since the array data is just
+	 * going to look like a C array of N int4 values.
+	 */
+	if (ARR_NDIM(blocksarray) != 1 ||
+		ARR_HASNULL(blocksarray) ||
+		ARR_ELEMTYPE(blocksarray) != INT4OID)
+		elog(ERROR, "expected 1 dimensional int4 array");
+
+	stream_data.curblock = 0;
+	stream_data.nblocks = ARR_DIMS(blocksarray)[0];
+	stream_data.blocks = (uint32 *) ARR_DATA_PTR(blocksarray);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										read_stream_for_blocks_cb,
+										&stream_data,
+										0);
+
+	for (int i = 0; i < stream_data.nblocks; i++)
+	{
+		Buffer		buf = read_stream_next_buffer(stream, NULL);
+		Datum		values[3] = {0};
+		bool		nulls[3] = {0};
+
+		if (!BufferIsValid(buf))
+			elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly invalid", i);
+
+		values[0] = Int32GetDatum(i);
+		values[1] = UInt32GetDatum(stream_data.blocks[i]);
+		values[2] = UInt32GetDatum(buf);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+
+		ReleaseBuffer(buf);
+	}
+
+	if (read_stream_next_buffer(stream, NULL) != InvalidBuffer)
+		elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly valid",
+			 stream_data.nblocks + 1);
+
+	read_stream_end(stream);
+
+	relation_close(rel, NoLock);
+
+	return (Datum) 0;
+}
+
+
 PG_FUNCTION_INFO_V1(handle_get);
 Datum
 handle_get(PG_FUNCTION_ARGS)
@@ -680,15 +808,98 @@ batch_end(PG_FUNCTION_ARGS)
 }
 
 #ifdef USE_INJECTION_POINTS
-extern PGDLLEXPORT void inj_io_short_read(const char *name,
-										  const void *private_data,
-										  void *arg);
+extern PGDLLEXPORT void inj_io_completion_hook(const char *name,
+											   const void *private_data,
+											   void *arg);
 extern PGDLLEXPORT void inj_io_reopen(const char *name,
 									  const void *private_data,
 									  void *arg);
 
-void
-inj_io_short_read(const char *name, const void *private_data, void *arg)
+static bool
+inj_io_short_read_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_short_read)
+		return false;
+
+	if (!inj_io_error_state->short_read_result_set)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->short_read_pid != 0 &&
+		inj_io_error_state->short_read_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->short_read_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->short_read_relfilenode)
+		return false;
+
+	/*
+	 * Only shorten reads that are actually longer than the target size,
+	 * otherwise we can trigger over-reads.
+	 */
+	if (inj_io_error_state->short_read_result >= ioh->result)
+		return false;
+
+	return true;
+}
+
+static bool
+inj_io_completion_wait_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_completion_wait)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->completion_wait_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->completion_wait_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->completion_wait_relfilenode)
+		return false;
+
+	return true;
+}
+
+static void
+inj_io_completion_wait_hook(const char *name, const void *private_data, void *arg)
+{
+	PgAioHandle *ioh = (PgAioHandle *) arg;
+
+	if (!inj_io_completion_wait_matches(ioh))
+		return;
+
+	ConditionVariablePrepareToSleep(&inj_io_error_state->cv);
+
+	while (true)
+	{
+		if (!inj_io_completion_wait_matches(ioh))
+			break;
+
+		ConditionVariableSleep(&inj_io_error_state->cv,
+							   inj_io_error_state->completion_wait_event);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+static void
+inj_io_short_read_hook(const char *name, const void *private_data, void *arg)
 {
 	PgAioHandle *ioh = (PgAioHandle *) arg;
 
@@ -697,58 +908,56 @@ inj_io_short_read(const char *name, const void *private_data, void *arg)
 				   inj_io_error_state->enabled_reopen),
 			errhidestmt(true), errhidecontext(true));
 
-	if (inj_io_error_state->enabled_short_read)
+	if (inj_io_short_read_matches(ioh))
 	{
+		struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+		int32		old_result = ioh->result;
+		int32		new_result = inj_io_error_state->short_read_result;
+		int32		processed = 0;
+
+		ereport(LOG,
+				errmsg("short read inject point, changing result from %d to %d",
+					   old_result, new_result),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
-		 * Only shorten reads that are actually longer than the target size,
-		 * otherwise we can trigger over-reads.
+		 * The underlying IO actually completed OK, and thus the "invalid"
+		 * portion of the IOV actually contains valid data. That can hide a
+		 * lot of problems, e.g. if we were to wrongly mark a buffer, that
+		 * wasn't read according to the shortened-read, IO as valid, the
+		 * contents would look valid and we might miss a bug.
+		 *
+		 * To avoid that, iterate through the IOV and zero out the "failed"
+		 * portion of the IO.
 		 */
-		if (inj_io_error_state->short_read_result_set
-			&& ioh->op == PGAIO_OP_READV
-			&& inj_io_error_state->short_read_result <= ioh->result)
+		for (int i = 0; i < ioh->op_data.read.iov_length; i++)
 		{
-			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
-			int32		old_result = ioh->result;
-			int32		new_result = inj_io_error_state->short_read_result;
-			int32		processed = 0;
-
-			ereport(LOG,
-					errmsg("short read inject point, changing result from %d to %d",
-						   old_result, new_result),
-					errhidestmt(true), errhidecontext(true));
-
-			/*
-			 * The underlying IO actually completed OK, and thus the "invalid"
-			 * portion of the IOV actually contains valid data. That can hide
-			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
-			 * that wasn't read according to the shortened-read, IO as valid,
-			 * the contents would look valid and we might miss a bug.
-			 *
-			 * To avoid that, iterate through the IOV and zero out the
-			 * "failed" portion of the IO.
-			 */
-			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			if (processed + iov[i].iov_len <= new_result)
+				processed += iov[i].iov_len;
+			else if (processed <= new_result)
 			{
-				if (processed + iov[i].iov_len <= new_result)
-					processed += iov[i].iov_len;
-				else if (processed <= new_result)
-				{
-					uint32		ok_part = new_result - processed;
-
-					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
-					processed += iov[i].iov_len;
-				}
-				else
-				{
-					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
-				}
-			}
+				uint32		ok_part = new_result - processed;
 
-			ioh->result = new_result;
+				memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+				processed += iov[i].iov_len;
+			}
+			else
+			{
+				memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+			}
 		}
+
+		ioh->result = new_result;
 	}
 }
 
+void
+inj_io_completion_hook(const char *name, const void *private_data, void *arg)
+{
+	inj_io_completion_wait_hook(name, private_data, arg);
+	inj_io_short_read_hook(name, private_data, arg);
+}
+
 void
 inj_io_reopen(const char *name, const void *private_data, void *arg)
 {
@@ -762,6 +971,39 @@ inj_io_reopen(const char *name, const void *private_data, void *arg)
 }
 #endif
 
+PG_FUNCTION_INFO_V1(inj_io_completion_wait);
+Datum
+inj_io_completion_wait(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = true;
+	inj_io_error_state->completion_wait_pid =
+		PG_ARGISNULL(0) ? 0 : PG_GETARG_INT32(0);
+	inj_io_error_state->completion_wait_relfilenode =
+		PG_ARGISNULL(1) ? InvalidOid : PG_GETARG_OID(1);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_completion_continue);
+Datum
+inj_io_completion_continue(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = false;
+	inj_io_error_state->completion_wait_pid = 0;
+	inj_io_error_state->completion_wait_relfilenode = InvalidOid;
+	ConditionVariableBroadcast(&inj_io_error_state->cv);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
 Datum
 inj_io_short_read_attach(PG_FUNCTION_ARGS)
@@ -771,6 +1013,10 @@ inj_io_short_read_attach(PG_FUNCTION_ARGS)
 	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
 	if (inj_io_error_state->short_read_result_set)
 		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+	inj_io_error_state->short_read_pid =
+		PG_ARGISNULL(1) ? 0 : PG_GETARG_INT32(1);
+	inj_io_error_state->short_read_relfilenode =
+		PG_ARGISNULL(2) ? 0 : PG_GETARG_OID(2);
 #else
 	elog(ERROR, "injection points not supported");
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1c8610fd46c..db583985813 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -302,6 +302,7 @@ BlockSampler
 BlockSamplerData
 BlockedProcData
 BlockedProcsData
+BlocksReadStreamData
 BlocktableEntry
 BloomBuildState
 BloomFilter
-- 
2.43.0



  [text/x-patch] v3-0003-fix-test.patch (1.2K, 4-v3-0003-fix-test.patch)
  download | inline diff:
From e9e0bc1c73de0edc23a391a59f48ea3ee64cf707 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:09:14 -0500
Subject: [PATCH v3 3/5] fix test

---
 src/test/modules/test_aio/t/004_read_stream.pl | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
index 89cfabbb1d3..64567de37c2 100644
--- a/src/test/modules/test_aio/t/004_read_stream.pl
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -205,15 +205,13 @@ SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5
 		$psql_a->{run}, $psql_a->{timeout},
 		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
 
-	$psql_b->{run}->pump_nb();
-	like(
-		$psql_b->{stderr},
-		qr/.*ERROR.*could not read blocks 5..5.*$/,
-		"$io_method: injected error occurred");
+	pump_until(
+		$psql_b->{run}, $psql_b->{timeout},
+		\$psql_b->{stderr}, qr/ERROR.*could not read blocks 5\.\.5/);
+	ok(1, "$io_method: injected error occurred");
 	$psql_b->{stderr} = '';
 	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
 
-
 	ok(1,
 		qq/$io_method: read stream encounters failing IO by another backend/);
 
-- 
2.43.0



  [text/x-patch] v3-0004-Make-buffer-hit-helper.patch (5.8K, 5-v3-0004-Make-buffer-hit-helper.patch)
  download | inline diff:
From 276450fe9f84c409164cbd0f33971a224644eb79 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:54:02 -0500
Subject: [PATCH v3 4/5] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.
---
 src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
 1 file changed, 56 insertions(+), 55 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f935648ae9..bad8894011a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -638,6 +638,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+
+static void ProcessBufferHit(BufferAccessStrategy strategy,
+							 Relation rel, char persistence, SMgrRelation smgr,
+							 ForkNumber forknum, BlockNumber blocknum);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
@@ -1216,8 +1220,6 @@ PinBufferForBlock(Relation rel,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
 
@@ -1226,17 +1228,6 @@ PinBufferForBlock(Relation rel,
 			persistence == RELPERSISTENCE_PERMANENT ||
 			persistence == RELPERSISTENCE_UNLOGGED));
 
-	if (persistence == RELPERSISTENCE_TEMP)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -1244,18 +1235,11 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.backend);
 
 	if (persistence == RELPERSISTENCE_TEMP)
-	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
 	else
-	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+							 strategy, foundPtr, IOContextForStrategy(strategy));
+
 	if (rel)
 	{
 		/*
@@ -1264,22 +1248,10 @@ PinBufferForBlock(Relation rel,
 		 * zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
-		if (*foundPtr)
-			pgstat_count_buffer_hit(rel);
 	}
-	if (*foundPtr)
-	{
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  true);
-	}
+	if (*foundPtr)
+		ProcessBufferHit(strategy, rel, persistence, smgr, forkNum, blockNum);
 
 	return BufferDescriptorGetBuffer(bufHdr);
 }
@@ -1685,6 +1657,51 @@ ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 	return ReadBuffersCanStartIOOnce(buffer, nowait);
 }
 
+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static void
+ProcessBufferHit(BufferAccessStrategy strategy,
+				 Relation rel, char persistence, SMgrRelation smgr,
+				 ForkNumber forknum, BlockNumber blocknum)
+{
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+}
+
 /*
  * Helper for WaitReadBuffers() that processes the results of a readv
  * operation, raising an error if necessary.
@@ -1980,25 +1997,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
-										  operation->smgr->smgr_rlocator.locator.spcOid,
-										  operation->smgr->smgr_rlocator.locator.dbOid,
-										  operation->smgr->smgr_rlocator.locator.relNumber,
-										  operation->smgr->smgr_rlocator.backend,
-										  true);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
-		else
-			pgBufferUsage.shared_blks_hit += 1;
-
-		if (operation->rel)
-			pgstat_count_buffer_hit(operation->rel);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		ProcessBufferHit(operation->strategy, operation->rel, persistence,
+						 operation->smgr, forknum,
+						 blocknum + operation->nblocks_done);
 	}
 	else
 	{
-- 
2.43.0



  [text/x-patch] v3-0005-Don-t-wait-for-already-in-progress-IO.patch (20.6K, 6-v3-0005-Don-t-wait-for-already-in-progress-IO.patch)
  download | inline diff:
From fb9ba6b67df5060bcd788cbd72988734718c6a7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v3 5/5] Don't wait for already in-progress IO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a backend attempts to start a read on a buffer and finds that I/O
is already in progress, it previously waited for that I/O to complete
before initiating reads for any other buffers. Although the backend must
still wait for the I/O to finish when later acquiring the buffer, it
should not need to wait at read start time. Other buffers may be
available for I/O, and in some workloads this waiting significantly
reduces concurrency.

For example, index scans may repeatedly request the same heap block. If
the backend waits each time it encounters an in-progress read, the
access pattern effectively degenerates into synchronous I/O. By
introducing the concept of foreign I/O operations, a backend can record
the buffer’s wait reference and defer waiting until WaitReadBuffers()
when it actually acquires the buffer.

In rare cases, a backend may still need to wait when starting a read if
it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
but before the buffer descriptor’s wait reference has been set. Such
windows should be brief and uncommon.
---
 src/backend/storage/buffer/bufmgr.c | 481 ++++++++++++++++++----------
 src/include/storage/bufmgr.h        |   1 +
 src/tools/pgindent/typedefs.list    |   1 +
 3 files changed, 320 insertions(+), 163 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bad8894011a..55c77e10a81 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -169,6 +169,21 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
+
+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBuffer_Status
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBuffer_Status;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1618,45 +1633,6 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
 #endif
 }
 
-/* helper for ReadBuffersCanStartIO(), to avoid repetition */
-static inline bool
-ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
-{
-	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
-								  true, nowait);
-	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
-}
-
-/*
- * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
- */
-static inline bool
-ReadBuffersCanStartIO(Buffer buffer, bool nowait)
-{
-	/*
-	 * If this backend currently has staged IO, we need to submit the pending
-	 * IO before waiting for the right to issue IO, to avoid the potential for
-	 * deadlocks (and, more commonly, unnecessary delays for other backends).
-	 */
-	if (!nowait && pgaio_have_staged())
-	{
-		if (ReadBuffersCanStartIOOnce(buffer, true))
-			return true;
-
-		/*
-		 * Unfortunately StartBufferIO() returning false doesn't allow to
-		 * distinguish between the buffer already being valid and IO already
-		 * being in progress. Since IO already being in progress is quite
-		 * rare, this approach seems fine.
-		 */
-		pgaio_submit_staged();
-	}
-
-	return ReadBuffersCanStartIOOnce(buffer, nowait);
-}
-
 /*
  * We track various stats related to buffer hits. Because this is done in a
  * few separate places, this helper exists for convenience.
@@ -1806,7 +1782,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1825,11 +1801,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint32		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					ProcessBufferHit(operation->strategy,
+									 operation->rel, operation->persistence,
+									 operation->smgr, operation->forknum,
+									 operation->blocknum + operation->nblocks_done);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1860,6 +1858,159 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
 }
 
+/*
+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+							Buffer buffer)
+{
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/* Already valid, no work to do */
+	if (buf_state & BM_VALID)
+	{
+		pgaio_wref_clear(&operation->io_wref);
+		return READ_BUFFER_ALREADY_DONE;
+	}
+
+	pgaio_submit_staged();
+
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
+	}
+
+	return READ_BUFFER_READY_FOR_IO;
+}
+
+/*
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
+ *
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
+ */
+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+					   Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareNewLocalReadBufferIO(operation, buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
+	{
+		buf_state = LockBufHdr(desc);
+
+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			pgaio_wref_clear(&operation->io_wref);
+			return READ_BUFFER_ALREADY_DONE;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the wait ref is set or the IO
+			 * completes.
+			 */
+			UnlockBufHdr(desc);
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;
+		}
+
+		/*
+		 * No IO in progress and not already valid; We will start IO. It's
+		 * possible that the IO was in progress and never became valid because
+		 * the IO errored out. We'll do the IO ourselves.
+		 */
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+									  BufferDescriptorGetBuffer(desc));
+
+		return READ_BUFFER_READY_FOR_IO;
+	}
+}
+
+
+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+*/
+static bool
+PrepareAdditionalReadBuffer(Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		/* Local buffers don't use BM_IO_IN_PROGRESS */
+		if (buf_state & BM_VALID || pgaio_wref_valid(&desc->io_wref))
+			return false;
+	}
+	else
+	{
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+		if (buf_state & (BM_VALID | BM_IO_IN_PROGRESS))
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner, buffer);
+	}
+
+	return true;
+}
+
 /*
  * Initiate IO for the ReadBuffersOperation
  *
@@ -1893,7 +2044,75 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		did_start_io;
+	instr_time	io_start;
+	PrepareReadBuffer_Status status;
+
+	/*
+	 * We must get an IO handle before StartNewBufferReadIO(), as
+	 * pgaio_io_acquire() might block, which we don't want after setting
+	 * IO_IN_PROGRESS. If we don't need to do the IO, we'll release the
+	 * handle.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	operation->foreign_io = false;
+
+	/* Check if we can start IO on the first to-be-read buffer */
+	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
+		READ_BUFFER_READY_FOR_IO)
+	{
+		pgaio_io_release(ioh);
+		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone else has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
+
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			ProcessBufferHit(operation->strategy,
+							 operation->rel, operation->persistence,
+							 operation->smgr, operation->forknum,
+							 operation->blocknum + operation->nblocks_done);
+			return false;
+		}
+
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		CheckReadBuffersOperation(operation, false);
+		return true;
+	}
+
+	/* We can read in at least the head buffer . */
+	Assert(status == READ_BUFFER_READY_FOR_IO);
 
 	/*
 	 * When this IO is executed synchronously, either because the caller will
@@ -1944,138 +2163,74 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	 */
 	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
 
-	/*
-	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
-	 * might block, which we don't want after setting IO_IN_PROGRESS.
-	 *
-	 * If we need to wait for IO before we can get a handle, submit
-	 * already-staged IO first, so that other backends don't need to wait.
-	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
-	 * wait for already submitted IO, which doesn't require additional locks,
-	 * but it could still cause undesirable waits.
-	 *
-	 * A secondary benefit is that this would allow us to measure the time in
-	 * pgaio_io_acquire() without causing undue timer overhead in the common,
-	 * non-blocking, case.  However, currently the pgstats infrastructure
-	 * doesn't really allow that, as it a) asserts that an operation can't
-	 * have time without operations b) doesn't have an API to report
-	 * "accumulated" time.
-	 */
-	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
-	if (unlikely(!ioh))
-	{
-		pgaio_submit_staged();
-
-		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
-	}
+	Assert(io_buffers[0] == buffers[nblocks_done]);
+	io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
+	io_buffers_len = 1;
 
 	/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
 	 */
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 	{
-		/*
-		 * Someone else has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
-		*nblocks_progress = 1;
-
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
+		if (!PrepareAdditionalReadBuffer(buffers[i]))
+			break;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);
+		Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-		/*
-		 * Report and track this as a 'hit' for this backend, even though it
-		 * must have started out as a miss in PinBufferForBlock(). The other
-		 * backend will track this as a 'read'.
-		 */
-		ProcessBufferHit(operation->strategy, operation->rel, persistence,
-						 operation->smgr, forknum,
-						 blocknum + operation->nblocks_done);
+		io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 	}
-	else
-	{
-		instr_time	io_start;
-
-		/* We found a buffer that we need to read in. */
-		Assert(io_buffers[0] == buffers[nblocks_done]);
-		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
-		io_buffers_len = 1;
-
-		/*
-		 * How many neighboring-on-disk blocks can we scatter-read into other
-		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
-		 * head block, so we should get on with that I/O as soon as possible.
-		 */
-		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
-		{
-			if (!ReadBuffersCanStartIO(buffers[i], true))
-				break;
-			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
-				   BufferGetBlockNumber(buffers[i]) - 1);
-			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
-		}
+	/* get a reference to wait for in WaitReadBuffers() */
+	pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* get a reference to wait for in WaitReadBuffers() */
-		pgaio_io_get_wref(ioh, &operation->io_wref);
+	/* provide the list of buffers to the completion callbacks */
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-		/* provide the list of buffers to the completion callbacks */
-		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+	pgaio_io_register_callbacks(ioh,
+								persistence == RELPERSISTENCE_TEMP ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								flags);
 
-		pgaio_io_register_callbacks(ioh,
-									persistence == RELPERSISTENCE_TEMP ?
-									PGAIO_HCB_LOCAL_BUFFER_READV :
-									PGAIO_HCB_SHARED_BUFFER_READV,
-									flags);
+	pgaio_io_set_flag(ioh, ioh_flags);
 
-		pgaio_io_set_flag(ioh, ioh_flags);
+	/* ---
+	* Even though we're trying to issue IO asynchronously, track the time
+	* in smgrstartreadv():
+	* - if io_method == IOMETHOD_SYNC, we will always perform the IO
+	*   immediately
+	* - the io method might not support the IO (e.g. worker IO for a temp
+	*   table)
+	* ---
+	*/
+	io_start = pgstat_prepare_io_time(track_io_timing);
+	smgrstartreadv(ioh, operation->smgr, forknum,
+				   blocknum + nblocks_done,
+				   io_pages, io_buffers_len);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+							io_start, 1, io_buffers_len * BLCKSZ);
 
-		/* ---
-		 * Even though we're trying to issue IO asynchronously, track the time
-		 * in smgrstartreadv():
-		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
-		 *   immediately
-		 * - the io method might not support the IO (e.g. worker IO for a temp
-		 *   table)
-		 * ---
-		 */
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum + nblocks_done,
-					   io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
-								io_start, 1, io_buffers_len * BLCKSZ);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
-		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += io_buffers_len;
+	else
+		pgBufferUsage.shared_blks_read += io_buffers_len;
 
-		/*
-		 * Track vacuum cost when issuing IO, not after waiting for it.
-		 * Otherwise we could end up issuing a lot of IO in a short timespan,
-		 * despite a low cost limit.
-		 */
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	/*
+	 * Track vacuum cost when issuing IO, not after waiting for it. Otherwise
+	 * we could end up issuing a lot of IO in a short timespan, despite a low
+	 * cost limit.
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
-		*nblocks_progress = io_buffers_len;
-		did_start_io = true;
-	}
+	*nblocks_progress = io_buffers_len;
 
-	return did_start_io;
+	return true;
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a40adf6b2a8..1358fc7fa64 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index db583985813..6c6bdc8ac4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2341,6 +2341,7 @@ PredicateLockData
 PredicateLockTargetType
 PrefetchBufferResult
 PrepParallelRestorePtrType
+PrepareReadBuffer_Status
 PrepareStmt
 PreparedStatement
 PresortedKeyData
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-02-05 16:56     ` Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Nazir Bilal Yavuz @ 2026-02-05 16:56 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

Thank you for working on this!

On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
<[email protected]> wrote:
>
> On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <[email protected]> wrote:
> >
> > I suppose (or perhaps vaguely recall from an off-list discussion?)
> > that you must have considered merging the new
> > is-it-already-in-progress check into ReadBuffersCanStartIO().  I
> > suppose the nowait argument would become a tri-state argument with a
> > value that means "don't wait for an in-progress read, just give me the
> > IO handle so I can 'join' it as a foreign waiter", with a new output
> > argument to receive the handle, or something along those lines, and I
> > guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
> > name.  That'd remove the double-check (extra header lock-unlock cycle)
> > and associated race that can cause that rare synchronous wait (which
> > must still happen sometimes in the duelling concurrent scan use
> > case?), at the slight extra cost of having to allocate and free a
> > handle in the case of repeated blocks (eg the index->heap scan use
> > case), but at least that's just backend-local list pushups and doesn't
> > do extra work otherwise.  Is there some logical problem with that
> > approach?  Is the code just too clumsy?
>
> Attached v3 basically does what you suggested above. Now, we should
> only have to wait if the backend encounters a buffer after another
> backend has set BM_IO_IN_PROGRESS but before that other backend has
> set the buffer descriptor's wait reference.
>
> 0001 and 0002 are Andres' test-related patches. 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).
> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.

I confirm that I am able to produce the regression that Andres
mentioned with the patches excluding 0005, and 0005 fixes the
regression.

> A few things about the patch that I'm not sure about:
>
> - I don't know if pgaio_submit_staged() is in all the right places
> (and not in too many places). I basically do it before we would wait
> when starting read IO on the buffer. In the permanent buffers case,
> that's now only when BM_IO_IN_PROGRESS is set but the wait reference
> isn't valid yet. This can't happen in the temporary buffers case, so
> I'm not sure we need to call pgaio_submit_staged().

I agree with you, I think we don't need to call pgaio_submit_staged()
for the temporary buffers case.

> - StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
> We could refactor it so that it works for AsyncReadBuffers(), but that
> would involve returning something that distinguishes between
> IO_IN_PROGRESS and IO already done.  And StartBufferIO()'s comment
> explicitly says it wants to avoid that.
> If we keep my structure, with AsyncReadBuffers() using its own helper
> (PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
> like we need some way to make it clear what StartBufferIO() is for.
> I'm not sure what would collectively describe its current users,
> though. It also now has no non-test callers passing nowait as true.
> However, once we add write combining, it will, so it seems like we
> should leave it the way it is to avoid churn. However, other
> developers might be confused in the interim.

I don't have a comment for this.

> - In the 004_read_stream tests, I wonder if there is a way to test
> that we don't wait for foreign IO until WaitReadBuffers(). We have
> tests for the stream accessing the same block, which in some cases
> will exercise the foreign IO path. But it doesn't distinguish between
> the old behavior -- waiting for the IO to complete when starting read
> IO on it -- and the new behavior -- not waiting until
> WaitReadBuffers(). That may not be possible to test, though.

Won't 'stream accessing the same block test' almost always test the
new behavior (not waiting until WaitReadBuffers())? Having dedicated
tests for both cases would be helpful, though.

My review:

0001:

0001 LGTM.
---------------

0002:

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+    $node->start();
+    test_io_method($method, $node);
+    $node->stop();
+}

This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
    is($node->safe_psql('postgres', 'SHOW io_method'),
        $io_method, "$io_method: io_method set correctly");
```

Other than that, 0002 LGTM.
---------------

0003:

> 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).

0003 LGTM.
---------------

> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.

0004:

Nitpick but I prefer something like TrackBufferHit() or
CountBufferHit() as a function name instead of ProcessBufferHit().
ProcessBufferHit() gives the impression that the function is doing a
job more than it currently does. Other than that, 0004 LGTM.
---------------

0005:

0005 LGTM. However, I am still looking into the AIO code. I wanted to
share my review so far.
---------------

--
Regards,
Nazir Bilal Yavuz
Microsoft






^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
@ 2026-03-03 19:47       ` Melanie Plageman <[email protected]>
  2026-03-03 20:07         ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  0 siblings, 2 replies; 31+ messages in thread

From: Melanie Plageman @ 2026-03-03 19:47 UTC (permalink / raw)
  To: Nazir Bilal Yavuz <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Thanks for the review!

On Thu, Feb 5, 2026 at 11:56 AM Nazir Bilal Yavuz <[email protected]> wrote:
>
> On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
> <[email protected]> wrote:
>
> > - In the 004_read_stream tests, I wonder if there is a way to test
> > that we don't wait for foreign IO until WaitReadBuffers(). We have
> > tests for the stream accessing the same block, which in some cases
> > will exercise the foreign IO path. But it doesn't distinguish between
> > the old behavior -- waiting for the IO to complete when starting read
> > IO on it -- and the new behavior -- not waiting until
> > WaitReadBuffers(). That may not be possible to test, though.
>
> Won't 'stream accessing the same block test' almost always test the
> new behavior (not waiting until WaitReadBuffers())? Having dedicated
> tests for both cases would be helpful, though.

Yea, I was thinking something like testing that if session A is
blocked completing read of block 2 and session B is requesting blocks
2-4 that buffers containing blocks 3 and 4 are valid when session B is
waiting on block 2 to finish.

I started working on something but it needed some new infrastructure
to check if the buffer is valid, and I wanted to see what others
thought first.

I did finally review Andres' test patches and have included my review
feedback here as well.

"aio: Refactor tests in preparation for more tests" (v4-0001) looks
good to me as well. I included one tiny idea AI suggested to me in a
follow-on patch (v4-0002).

> diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
> b/src/test/modules/test_aio/t/004_read_stream.pl
> +foreach my $method (TestAio::supported_io_methods())
> +{
> +    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
> +    $node->start();
> +    test_io_method($method, $node);
> +    $node->stop();
> +}
>
> This seems wrong, we always test io_method=worker. I think we need to
> replace 'worker' with the $method. Also, we can add check below to the
> test_io_method function in the 004_read_stream.pl:
> ```
>     is($node->safe_psql('postgres', 'SHOW io_method'),
>         $io_method, "$io_method: io_method set correctly");

Good catch. Fixed. I also found a few other small things in this patch
(v4-0003) which I put in v4-0004.

Some ideas I had that I didn't include in v4-0003 because its Andres
patch and is subjective:

For test_repeated_blocks, the first test:

    # test miss of the same block twice in a row
    $psql->query_safe(
        qq/
SELECT evict_rel('largeish');
/);
    $psql->query_safe(
        qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
/);
    ok(1, "$io_method: stream missing the same block repeatedly");

It says that it will miss the same block repeatedly, is that because
we won't start a read for any of the blocks until after
read_stream_get_block has returned all of them? If so, could be
clearer in the comment. Not everyone understands all the read stream
internals.

I know a lot of other tests do this, but I find it so hard to read the
test with the SQL is totally left-aligned like that -- especially with
comments interspersed. You can easily flow the queries on multiple
lines and indent it more.

For test_repeated_blocks, the second test:

    # test hit of the same block twice in a row
    $psql->query_safe(
        qq/
SELECT evict_rel('largeish');
/);
    $psql->query_safe(
        qq/
SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
5, 6, 5, 4, 3, 2, 1, 0]);
/);
    ok(1, "$io_method: stream accessing same block");

I assume that the second access of 2 is a hit because we actually did
IO for the first one (unlike in the earlier case)?

For test_inject_foreign:

In general, I am not ramped up enough on injection point stuff to know
if the actual new injection point stuff you added in test_aio.c is is
correct and optimal, but I did review the read stream additions to
test_aio.c and the tests added to 004_read_stream.pl.

For test_inject_foreign, the 3rd test:

    # Test read stream encountering two buffers that are undergoing the same
    # IO, started by another backend

I see that psql_b is requesting 3 blocks which can be combined into 1
IO, which makes it different than the 1st foreign IO test case:

    ###
    # Test read stream encountering buffers undergoing IO in another backend,
    # with the other backend's reads succeeding.
    ###

where psql_b only requests 1 but I don't really see how these are
covering different code. Maybe if the read stream one (psql_a) is
doing a combined IO it might exercise slightly different code, but
otherwise I don't get it.

> Nitpick but I prefer something like TrackBufferHit() or
> CountBufferHit() as a function name instead of ProcessBufferHit().
> ProcessBufferHit() gives the impression that the function is doing a
> job more than it currently does. Other than that, 0004 LGTM.

I changed this in attached v4.

- Melanie


Attachments:

  [text/x-patch] v4-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch (10.8K, 2-v4-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch)
  download | inline diff:
From 1d2f564b211a59fc6ea483fbcddd5fa788b3534c Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 9 Sep 2025 10:14:34 -0400
Subject: [PATCH v4 1/6] aio: Refactor tests in preparation for more tests

In a future commit more AIO related tests are due to be introduced. However
001_aio.pl already is fairly large.

This commit introduces a new TestAio package with helpers for writing AIO
related tests. Then it uses the new helpers to simplify the existing
001_aio.pl by iterating over all supported io_methods. This will be
particularly helpful because additional methods already have been submitted.

Additionally this commit splits out testing of initdb using a non-default
method into its own test. While that test is somewhat important, it's fairly
slow and doesn't break that often. For development velocity it's helpful for
001_aio.pl to be faster.

While particularly the latter could benefit from being its own commit, it
seems to introduce more back-and-forth than it's worth.

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/
---
 src/test/modules/test_aio/meson.build     |   1 +
 src/test/modules/test_aio/t/001_aio.pl    | 140 +++++++---------------
 src/test/modules/test_aio/t/003_initdb.pl |  71 +++++++++++
 src/test/modules/test_aio/t/TestAio.pm    |  90 ++++++++++++++
 4 files changed, 203 insertions(+), 99 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/003_initdb.pl
 create mode 100644 src/test/modules/test_aio/t/TestAio.pm

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index fefa25bc5ab..18a797f3a3b 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -32,6 +32,7 @@ tests += {
     'tests': [
       't/001_aio.pl',
       't/002_io_workers.pl',
+      't/003_initdb.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 5c634ec3ca9..27ee96898e0 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -7,126 +7,55 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use FindBin;
+use lib $FindBin::RealBin;
 
-###
-# Test io_method=worker
-###
-my $node_worker = create_node('worker');
-$node_worker->start();
-
-test_generic('worker', $node_worker);
-SKIP:
-{
-	skip 'Injection points not supported by this build', 1
-	  unless $ENV{enable_injection_points} eq 'yes';
-	test_inject_worker('worker', $node_worker);
-}
+use TestAio;
 
-$node_worker->stop();
+my %nodes;
 
 
 ###
-# Test io_method=io_uring
+# Create and configure one instance for each io_method
 ###
 
-if (have_io_uring())
+foreach my $method (TestAio::supported_io_methods())
 {
-	my $node_uring = create_node('io_uring');
-	$node_uring->start();
-	test_generic('io_uring', $node_uring);
-	$node_uring->stop();
-}
-
-
-###
-# Test io_method=sync
-###
-
-my $node_sync = create_node('sync');
+	my $node = PostgreSQL::Test::Cluster->new($method);
 
-# just to have one test not use the default auto-tuning
+	$nodes{$method} = $node;
+	$node->init();
+	$node->append_conf('postgresql.conf', "io_method=$method");
+	TestAio::configure($node);
+}
 
-$node_sync->append_conf(
+# Just to have one test not use the default auto-tuning
+$nodes{'sync'}->append_conf(
 	'postgresql.conf', qq(
-io_max_concurrency=4
+ io_max_concurrency=4
 ));
 
-$node_sync->start();
-test_generic('sync', $node_sync);
-$node_sync->stop();
-
-done_testing();
-
 
 ###
-# Test Helpers
+# Execute the tests for each io_method
 ###
 
-sub create_node
+foreach my $method (TestAio::supported_io_methods())
 {
-	local $Test::Builder::Level = $Test::Builder::Level + 1;
-
-	my $io_method = shift;
+	my $node = $nodes{$method};
 
-	my $node = PostgreSQL::Test::Cluster->new($io_method);
-
-	# Want to test initdb for each IO method, otherwise we could just reuse
-	# the cluster.
-	#
-	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
-	# options specified by ->extra, if somebody puts -c io_method=xyz in
-	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
-	# detect it.
-	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
-	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
-		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
-	{
-		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
-	}
-
-	$node->init(extra => [ '-c', "io_method=$io_method" ]);
-
-	$node->append_conf(
-		'postgresql.conf', qq(
-shared_preload_libraries=test_aio
-log_min_messages = 'DEBUG3'
-log_statement=all
-log_error_verbosity=default
-restart_after_crash=false
-temp_buffers=100
-));
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
 
-	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
-	# io_method, it'd override the setting persisted at initdb time. While
-	# using (and later verifying) the setting from initdb provides some
-	# verification of having used the io_method during initdb, it's probably
-	# not worth the complication of only appending if the variable is set in
-	# in TEMP_CONFIG.
-	$node->append_conf(
-		'postgresql.conf', qq(
-io_method=$io_method
-));
+done_testing();
 
-	ok(1, "$io_method: initdb");
 
-	return $node;
-}
+###
+# Test Helpers
+###
 
-sub have_io_uring
-{
-	# To detect if io_uring is supported, we look at the error message for
-	# assigning an invalid value to an enum GUC, which lists all the valid
-	# options. We need to use -C to deal with running as administrator on
-	# windows, the superuser check is omitted if -C is used.
-	my ($stdout, $stderr) =
-	  run_command [qw(postgres -C invalid -c io_method=invalid)];
-	die "can't determine supported io_method values"
-	  unless $stderr =~ m/Available values: ([^\.]+)\./;
-	my $methods = $1;
-	note "supported io_method values are: $methods";
-
-	return ($methods =~ m/io_uring/) ? 1 : 0;
-}
 
 sub psql_like
 {
@@ -1490,8 +1419,8 @@ SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
 }
 
 
-# Run all tests that are supported for all io_methods
-sub test_generic
+# Run all tests that for the specified node / io_method
+sub test_io_method
 {
 	my $io_method = shift;
 	my $node = shift;
@@ -1526,10 +1455,23 @@ CHECKPOINT;
 	test_ignore_checksum($io_method, $node);
 	test_checksum_createdb($io_method, $node);
 
+	# generic injection tests
   SKIP:
 	{
 		skip 'Injection points not supported by this build', 1
 		  unless $ENV{enable_injection_points} eq 'yes';
 		test_inject($io_method, $node);
 	}
+
+	# worker specific injection tests
+	if ($io_method eq 'worker')
+	{
+	  SKIP:
+		{
+			skip 'Injection points not supported by this build', 1
+			  unless $ENV{enable_injection_points} eq 'yes';
+
+			test_inject_worker($io_method, $node);
+		}
+	}
 }
diff --git a/src/test/modules/test_aio/t/003_initdb.pl b/src/test/modules/test_aio/t/003_initdb.pl
new file mode 100644
index 00000000000..c03ae58d00a
--- /dev/null
+++ b/src/test/modules/test_aio/t/003_initdb.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test initdb for each IO method. This is done separately from 001_aio.pl, as
+# it isn't fast. This way the more commonly failing / hacked-on 001_aio.pl can
+# be iterated on more quickly.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	test_create_node($method);
+}
+
+done_testing();
+
+
+sub test_create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	TestAio::configure($node);
+
+	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
+	# io_method, it'd override the setting persisted at initdb time. While
+	# using (and later verifying) the setting from initdb provides some
+	# verification of having used the io_method during initdb, it's probably
+	# not worth the complication of only appending if the variable is set in
+	# in TEMP_CONFIG.
+	$node->append_conf(
+		'postgresql.conf', qq(
+io_method=$io_method
+));
+
+	ok(1, "$io_method: initdb");
+
+	$node->start();
+	$node->stop();
+	ok(1, "$io_method: start & stop");
+
+	return $node;
+}
diff --git a/src/test/modules/test_aio/t/TestAio.pm b/src/test/modules/test_aio/t/TestAio.pm
new file mode 100644
index 00000000000..5bc80a9b130
--- /dev/null
+++ b/src/test/modules/test_aio/t/TestAio.pm
@@ -0,0 +1,90 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+TestAio - helpers for writing AIO related tests
+
+=cut
+
+package TestAio;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item TestAio::supported_io_methods()
+
+Return an array of all the supported values for the io_method GUC
+
+=cut
+
+sub supported_io_methods()
+{
+	my @io_methods = ('worker');
+
+	push(@io_methods, "io_uring") if have_io_uring();
+
+	# Return sync last, as it will least commonly fail
+	push(@io_methods, "sync");
+
+	return @io_methods;
+}
+
+
+=item TestAio::configure()
+
+Prepare a cluster for AIO test
+
+=cut
+
+sub configure
+{
+	my $node = shift;
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+}
+
+
+=pod
+
+=item TestAio::have_io_uring()
+
+Return if io_uring is supported
+
+=cut
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+1;
-- 
2.43.0



  [text/x-patch] v4-0002-small-optimization-for-test-refactor.patch (1.1K, 3-v4-0002-small-optimization-for-test-refactor.patch)
  download | inline diff:
From a4c32be0224bb18bc55e98d7789135258824463d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 3 Mar 2026 12:27:10 -0500
Subject: [PATCH v4 2/6] small optimization for test refactor

---
 src/test/modules/test_aio/t/001_aio.pl | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 27ee96898e0..e18b2a2b8ae 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -12,6 +12,7 @@ use lib $FindBin::RealBin;
 
 use TestAio;
 
+my @methods = TestAio::supported_io_methods();
 my %nodes;
 
 
@@ -19,7 +20,7 @@ my %nodes;
 # Create and configure one instance for each io_method
 ###
 
-foreach my $method (TestAio::supported_io_methods())
+foreach my $method (@methods)
 {
 	my $node = PostgreSQL::Test::Cluster->new($method);
 
@@ -40,7 +41,7 @@ $nodes{'sync'}->append_conf(
 # Execute the tests for each io_method
 ###
 
-foreach my $method (TestAio::supported_io_methods())
+foreach my $method (@methods)
 {
 	my $node = $nodes{$method};
 
-- 
2.43.0



  [text/x-patch] v4-0003-test_aio-Add-read_stream-test-infrastructure-test.patch (22.9K, 4-v4-0003-test_aio-Add-read_stream-test-infrastructure-test.patch)
  download | inline diff:
From afc102665a6b2989f557c58bb8ae3d03eb3192cc Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 10 Sep 2025 14:00:02 -0400
Subject: [PATCH v4 3/6] test_aio: Add read_stream test infrastructure & tests

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/
---
 src/test/modules/test_aio/meson.build         |   1 +
 .../modules/test_aio/t/004_read_stream.pl     | 282 ++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql   |  26 +-
 src/test/modules/test_aio/test_aio.c          | 344 +++++++++++++++---
 src/tools/pgindent/typedefs.list              |   1 +
 5 files changed, 602 insertions(+), 52 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/004_read_stream.pl

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index 18a797f3a3b..909f81d96c1 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -33,6 +33,7 @@ tests += {
       't/001_aio.pl',
       't/002_io_workers.pl',
       't/003_initdb.pl',
+      't/004_read_stream.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
new file mode 100644
index 00000000000..89cfabbb1d3
--- /dev/null
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -0,0 +1,282 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init();
+
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+max_connections=8
+io_method=worker
+));
+
+$node->start();
+test_setup($node);
+$node->stop();
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	$node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
+
+done_testing();
+
+
+sub test_setup
+{
+	my $node = shift;
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+
+CREATE TABLE largeish(k int not null) WITH (FILLFACTOR=10);
+INSERT INTO largeish(k) SELECT generate_series(1, 10000);
+));
+	ok(1, "setup");
+}
+
+
+sub test_repeated_blocks
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Preventing larger reads makes testing easier
+	$psql->query_safe(
+		qq/
+SET io_combine_limit = 1;
+/);
+
+	# test miss of the same block twice in a row
+	$psql->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
+/);
+	ok(1, "$io_method: stream missing the same block repeatedly");
+
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
+/);
+	ok(1, "$io_method: stream hitting the same block repeatedly");
+
+	# test hit of the same block twice in a row
+	$psql->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+	$psql->query_safe(
+		qq/
+SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 0]);
+/);
+	ok(1, "$io_method: stream accessing same block");
+
+	$psql->quit();
+}
+
+
+sub test_inject_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	my $pid_a = $psql_a->query_safe(qq/SELECT pg_backend_pid();/);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads succeeding.
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>5, nblocks=>1);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	ok(1,
+		qq/$io_method: read stream encounters succeeding IO by another backend/
+	);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads failing.
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'), pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>5, nblocks=>1);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	$psql_b->{run}->pump_nb();
+	like(
+		$psql_b->{stderr},
+		qr/.*ERROR.*could not read blocks 5..5.*$/,
+		"$io_method: injected error occurred");
+	$psql_b->{stderr} = '';
+	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
+
+
+	ok(1,
+		qq/$io_method: read stream encounters failing IO by another backend/);
+
+
+	###
+	# Test read stream encountering two buffers that are undergoing the same
+	# IO, started by another backend
+	###
+	$psql_a->query_safe(
+		qq/
+SELECT evict_rel('largeish');
+/);
+
+	$psql_b->query_safe(
+		qq/
+SELECT inj_io_completion_wait(pid=>pg_backend_pid(), relfilenode=>pg_relation_filenode('largeish'));
+/);
+
+	$psql_b->{stdin} .= qq/
+SELECT read_rel_block_ll('largeish', blockno=>2, nblocks=>3);
+/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/
+SELECT wait_event FROM pg_stat_activity WHERE wait_event = 'completion_wait';
+/,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/
+SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 4]);
+/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,4\}/);
+
+	ok(1, qq/$io_method: read stream encounters two buffer read in one IO/);
+
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+
+sub test_io_method
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	test_repeated_blocks($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject_foreign($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..da7cc03829a 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -33,6 +33,10 @@ CREATE FUNCTION read_rel_block_ll(
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
+CREATE FUNCTION evict_rel(rel regclass)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
 CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
@@ -41,7 +45,7 @@ CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
 RETURNS pg_catalog.int4 STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool, assign_io bool DEFAULT false)
 RETURNS pg_catalog.bool STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
@@ -50,6 +54,14 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+/*
+ * Read stream related functions
+ */
+CREATE FUNCTION read_stream_for_blocks(rel regclass, blocks int4[], OUT blockoff int4, OUT blocknum int4, OUT buf int4)
+RETURNS SETOF record STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Handle related functions
@@ -91,8 +103,16 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 /*
  * Injection point related functions
  */
-CREATE FUNCTION inj_io_short_read_attach(result int)
-RETURNS pg_catalog.void STRICT
+CREATE FUNCTION inj_io_completion_wait(pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_completion_continue()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_attach(result int, pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION inj_io_short_read_detach()
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index b1aa8af9ec0..911a7102a34 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -20,16 +20,23 @@
 
 #include "access/relation.h"
 #include "fmgr.h"
+#include "funcapi.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "storage/read_stream.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/rel.h"
+#include "utils/wait_event.h"
 
 
 PG_MODULE_MAGIC;
@@ -37,13 +44,30 @@ PG_MODULE_MAGIC;
 
 typedef struct InjIoErrorState
 {
+	ConditionVariable cv;
+
 	bool		enabled_short_read;
 	bool		enabled_reopen;
 
+	bool		enabled_completion_wait;
+	Oid			completion_wait_relfilenode;
+	pid_t		completion_wait_pid;
+	uint32		completion_wait_event;
+
 	bool		short_read_result_set;
+	Oid			short_read_relfilenode;
+	pid_t		short_read_pid;
 	int			short_read_result;
 } InjIoErrorState;
 
+typedef struct BlocksReadStreamData
+{
+	int			nblocks;
+	int			curblock;
+	uint32	   *blocks;
+} BlocksReadStreamData;
+
+
 static InjIoErrorState *inj_io_error_state;
 
 /* Shared memory init callbacks */
@@ -85,10 +109,13 @@ test_aio_shmem_startup(void)
 		inj_io_error_state->enabled_short_read = false;
 		inj_io_error_state->enabled_reopen = false;
 
+		ConditionVariableInit(&inj_io_error_state->cv);
+		inj_io_error_state->completion_wait_event = WaitEventInjectionPointNew("completion_wait");
+
 #ifdef USE_INJECTION_POINTS
 		InjectionPointAttach("aio-process-completion-before-shared",
 							 "test_aio",
-							 "inj_io_short_read",
+							 "inj_io_completion_hook",
 							 NULL,
 							 0);
 		InjectionPointLoad("aio-process-completion-before-shared");
@@ -384,7 +411,7 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
 		elog(ERROR, "nblocks is out of range");
 
-	rel = relation_open(relid, AccessExclusiveLock);
+	rel = relation_open(relid, AccessShareLock);
 
 	for (int i = 0; i < nblocks; i++)
 	{
@@ -458,6 +485,27 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(evict_rel);
+Datum
+evict_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	int32		buffers_evicted,
+				buffers_flushed,
+				buffers_skipped;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	EvictRelUnpinnedBuffers(rel, &buffers_evicted, &buffers_flushed,
+							&buffers_skipped);
+
+	relation_close(rel, AccessExclusiveLock);
+
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(invalidate_rel_block);
 Datum
 invalidate_rel_block(PG_FUNCTION_ARGS)
@@ -610,6 +658,86 @@ buffer_call_terminate_io(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+
+static BlockNumber
+read_stream_for_blocks_cb(ReadStream *stream,
+						  void *callback_private_data,
+						  void *per_buffer_data)
+{
+	BlocksReadStreamData *stream_data = callback_private_data;
+
+	if (stream_data->curblock >= stream_data->nblocks)
+		return InvalidBlockNumber;
+	return stream_data->blocks[stream_data->curblock++];
+}
+
+PG_FUNCTION_INFO_V1(read_stream_for_blocks);
+Datum
+read_stream_for_blocks(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *blocksarray = PG_GETARG_ARRAYTYPE_P(1);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Relation	rel;
+	BlocksReadStreamData stream_data;
+	ReadStream *stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/*
+	 * We expect the input to be an N-element int4 array; verify that. We
+	 * don't need to use deconstruct_array() since the array data is just
+	 * going to look like a C array of N int4 values.
+	 */
+	if (ARR_NDIM(blocksarray) != 1 ||
+		ARR_HASNULL(blocksarray) ||
+		ARR_ELEMTYPE(blocksarray) != INT4OID)
+		elog(ERROR, "expected 1 dimensional int4 array");
+
+	stream_data.curblock = 0;
+	stream_data.nblocks = ARR_DIMS(blocksarray)[0];
+	stream_data.blocks = (uint32 *) ARR_DATA_PTR(blocksarray);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										read_stream_for_blocks_cb,
+										&stream_data,
+										0);
+
+	for (int i = 0; i < stream_data.nblocks; i++)
+	{
+		Buffer		buf = read_stream_next_buffer(stream, NULL);
+		Datum		values[3] = {0};
+		bool		nulls[3] = {0};
+
+		if (!BufferIsValid(buf))
+			elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly invalid", i);
+
+		values[0] = Int32GetDatum(i);
+		values[1] = UInt32GetDatum(stream_data.blocks[i]);
+		values[2] = UInt32GetDatum(buf);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+
+		ReleaseBuffer(buf);
+	}
+
+	if (read_stream_next_buffer(stream, NULL) != InvalidBuffer)
+		elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly valid",
+			 stream_data.nblocks + 1);
+
+	read_stream_end(stream);
+
+	relation_close(rel, NoLock);
+
+	return (Datum) 0;
+}
+
+
 PG_FUNCTION_INFO_V1(handle_get);
 Datum
 handle_get(PG_FUNCTION_ARGS)
@@ -680,15 +808,98 @@ batch_end(PG_FUNCTION_ARGS)
 }
 
 #ifdef USE_INJECTION_POINTS
-extern PGDLLEXPORT void inj_io_short_read(const char *name,
-										  const void *private_data,
-										  void *arg);
+extern PGDLLEXPORT void inj_io_completion_hook(const char *name,
+											   const void *private_data,
+											   void *arg);
 extern PGDLLEXPORT void inj_io_reopen(const char *name,
 									  const void *private_data,
 									  void *arg);
 
-void
-inj_io_short_read(const char *name, const void *private_data, void *arg)
+static bool
+inj_io_short_read_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_short_read)
+		return false;
+
+	if (!inj_io_error_state->short_read_result_set)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->short_read_pid != 0 &&
+		inj_io_error_state->short_read_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->short_read_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->short_read_relfilenode)
+		return false;
+
+	/*
+	 * Only shorten reads that are actually longer than the target size,
+	 * otherwise we can trigger over-reads.
+	 */
+	if (inj_io_error_state->short_read_result >= ioh->result)
+		return false;
+
+	return true;
+}
+
+static bool
+inj_io_completion_wait_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_completion_wait)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->completion_wait_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->completion_wait_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->completion_wait_relfilenode)
+		return false;
+
+	return true;
+}
+
+static void
+inj_io_completion_wait_hook(const char *name, const void *private_data, void *arg)
+{
+	PgAioHandle *ioh = (PgAioHandle *) arg;
+
+	if (!inj_io_completion_wait_matches(ioh))
+		return;
+
+	ConditionVariablePrepareToSleep(&inj_io_error_state->cv);
+
+	while (true)
+	{
+		if (!inj_io_completion_wait_matches(ioh))
+			break;
+
+		ConditionVariableSleep(&inj_io_error_state->cv,
+							   inj_io_error_state->completion_wait_event);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+static void
+inj_io_short_read_hook(const char *name, const void *private_data, void *arg)
 {
 	PgAioHandle *ioh = (PgAioHandle *) arg;
 
@@ -697,58 +908,56 @@ inj_io_short_read(const char *name, const void *private_data, void *arg)
 				   inj_io_error_state->enabled_reopen),
 			errhidestmt(true), errhidecontext(true));
 
-	if (inj_io_error_state->enabled_short_read)
+	if (inj_io_short_read_matches(ioh))
 	{
+		struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+		int32		old_result = ioh->result;
+		int32		new_result = inj_io_error_state->short_read_result;
+		int32		processed = 0;
+
+		ereport(LOG,
+				errmsg("short read inject point, changing result from %d to %d",
+					   old_result, new_result),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
-		 * Only shorten reads that are actually longer than the target size,
-		 * otherwise we can trigger over-reads.
+		 * The underlying IO actually completed OK, and thus the "invalid"
+		 * portion of the IOV actually contains valid data. That can hide a
+		 * lot of problems, e.g. if we were to wrongly mark a buffer, that
+		 * wasn't read according to the shortened-read, IO as valid, the
+		 * contents would look valid and we might miss a bug.
+		 *
+		 * To avoid that, iterate through the IOV and zero out the "failed"
+		 * portion of the IO.
 		 */
-		if (inj_io_error_state->short_read_result_set
-			&& ioh->op == PGAIO_OP_READV
-			&& inj_io_error_state->short_read_result <= ioh->result)
+		for (int i = 0; i < ioh->op_data.read.iov_length; i++)
 		{
-			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
-			int32		old_result = ioh->result;
-			int32		new_result = inj_io_error_state->short_read_result;
-			int32		processed = 0;
-
-			ereport(LOG,
-					errmsg("short read inject point, changing result from %d to %d",
-						   old_result, new_result),
-					errhidestmt(true), errhidecontext(true));
-
-			/*
-			 * The underlying IO actually completed OK, and thus the "invalid"
-			 * portion of the IOV actually contains valid data. That can hide
-			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
-			 * that wasn't read according to the shortened-read, IO as valid,
-			 * the contents would look valid and we might miss a bug.
-			 *
-			 * To avoid that, iterate through the IOV and zero out the
-			 * "failed" portion of the IO.
-			 */
-			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			if (processed + iov[i].iov_len <= new_result)
+				processed += iov[i].iov_len;
+			else if (processed <= new_result)
 			{
-				if (processed + iov[i].iov_len <= new_result)
-					processed += iov[i].iov_len;
-				else if (processed <= new_result)
-				{
-					uint32		ok_part = new_result - processed;
-
-					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
-					processed += iov[i].iov_len;
-				}
-				else
-				{
-					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
-				}
-			}
+				uint32		ok_part = new_result - processed;
 
-			ioh->result = new_result;
+				memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+				processed += iov[i].iov_len;
+			}
+			else
+			{
+				memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+			}
 		}
+
+		ioh->result = new_result;
 	}
 }
 
+void
+inj_io_completion_hook(const char *name, const void *private_data, void *arg)
+{
+	inj_io_completion_wait_hook(name, private_data, arg);
+	inj_io_short_read_hook(name, private_data, arg);
+}
+
 void
 inj_io_reopen(const char *name, const void *private_data, void *arg)
 {
@@ -762,6 +971,39 @@ inj_io_reopen(const char *name, const void *private_data, void *arg)
 }
 #endif
 
+PG_FUNCTION_INFO_V1(inj_io_completion_wait);
+Datum
+inj_io_completion_wait(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = true;
+	inj_io_error_state->completion_wait_pid =
+		PG_ARGISNULL(0) ? 0 : PG_GETARG_INT32(0);
+	inj_io_error_state->completion_wait_relfilenode =
+		PG_ARGISNULL(1) ? InvalidOid : PG_GETARG_OID(1);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_completion_continue);
+Datum
+inj_io_completion_continue(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = false;
+	inj_io_error_state->completion_wait_pid = 0;
+	inj_io_error_state->completion_wait_relfilenode = InvalidOid;
+	ConditionVariableBroadcast(&inj_io_error_state->cv);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
 Datum
 inj_io_short_read_attach(PG_FUNCTION_ARGS)
@@ -771,6 +1013,10 @@ inj_io_short_read_attach(PG_FUNCTION_ARGS)
 	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
 	if (inj_io_error_state->short_read_result_set)
 		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+	inj_io_error_state->short_read_pid =
+		PG_ARGISNULL(1) ? 0 : PG_GETARG_INT32(1);
+	inj_io_error_state->short_read_relfilenode =
+		PG_ARGISNULL(2) ? 0 : PG_GETARG_OID(2);
 #else
 	elog(ERROR, "injection points not supported");
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 77e3c04144e..668faaa5615 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -303,6 +303,7 @@ BlockSampler
 BlockSamplerData
 BlockedProcData
 BlockedProcsData
+BlocksReadStreamData
 BlocktableEntry
 BloomBuildState
 BloomFilter
-- 
2.43.0



  [text/x-patch] v4-0004-test-fixes.patch (2.5K, 5-v4-0004-test-fixes.patch)
  download | inline diff:
From d1eb014b043d71702bff6d1ba11e90c1e7f0c17a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 3 Mar 2026 13:27:51 -0500
Subject: [PATCH v4 4/6] test fixes

The pump_until change is needed for the test to work reliably on BSD.
---
 src/test/modules/test_aio/t/004_read_stream.pl | 15 ++++++++-------
 src/test/modules/test_aio/test_aio--1.0.sql    |  2 +-
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
index 89cfabbb1d3..f3fa018dd1c 100644
--- a/src/test/modules/test_aio/t/004_read_stream.pl
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -35,7 +35,7 @@ $node->stop();
 
 foreach my $method (TestAio::supported_io_methods())
 {
-	$node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+	$node->adjust_conf('postgresql.conf', 'io_method', $method);
 	$node->start();
 	test_io_method($method, $node);
 	$node->stop();
@@ -205,15 +205,13 @@ SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 5
 		$psql_a->{run}, $psql_a->{timeout},
 		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
 
-	$psql_b->{run}->pump_nb();
-	like(
-		$psql_b->{stderr},
-		qr/.*ERROR.*could not read blocks 5..5.*$/,
-		"$io_method: injected error occurred");
+	pump_until(
+		$psql_b->{run}, $psql_b->{timeout},
+		\$psql_b->{stderr}, qr/ERROR.*could not read blocks 5\.\.5/);
+	ok(1, "$io_method: injected error occurred");
 	$psql_b->{stderr} = '';
 	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
 
-
 	ok(1,
 		qq/$io_method: read stream encounters failing IO by another backend/);
 
@@ -271,6 +269,9 @@ sub test_io_method
 	my $io_method = shift;
 	my $node = shift;
 
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
 	test_repeated_blocks($io_method, $node);
 
   SKIP:
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index da7cc03829a..1cc4734a746 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -45,7 +45,7 @@ CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
 RETURNS pg_catalog.int4 STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool, assign_io bool DEFAULT false)
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
 RETURNS pg_catalog.bool STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-- 
2.43.0



  [text/x-patch] v4-0005-Make-buffer-hit-helper.patch (5.9K, 6-v4-0005-Make-buffer-hit-helper.patch)
  download | inline diff:
From a893d82fe0accd77c807ca3b791713954e319a2c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:54:02 -0500
Subject: [PATCH v4 5/6] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Reviewed-by: Nazir Bilal Yavuz <[email protected]>
---
 src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
 1 file changed, 56 insertions(+), 55 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d1babaff023..a749971ba7e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -639,6 +639,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+
+static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
+													  Relation rel, char persistence, SMgrRelation smgr,
+													  ForkNumber forknum, BlockNumber blocknum);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
@@ -1217,8 +1221,6 @@ PinBufferForBlock(Relation rel,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
 
@@ -1227,17 +1229,6 @@ PinBufferForBlock(Relation rel,
 			persistence == RELPERSISTENCE_PERMANENT ||
 			persistence == RELPERSISTENCE_UNLOGGED));
 
-	if (persistence == RELPERSISTENCE_TEMP)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -1245,18 +1236,11 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.backend);
 
 	if (persistence == RELPERSISTENCE_TEMP)
-	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
 	else
-	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+							 strategy, foundPtr, IOContextForStrategy(strategy));
+
 	if (rel)
 	{
 		/*
@@ -1265,22 +1249,10 @@ PinBufferForBlock(Relation rel,
 		 * zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
-		if (*foundPtr)
-			pgstat_count_buffer_hit(rel);
 	}
-	if (*foundPtr)
-	{
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  true);
-	}
+	if (*foundPtr)
+		CountBufferHit(strategy, rel, persistence, smgr, forkNum, blockNum);
 
 	return BufferDescriptorGetBuffer(bufHdr);
 }
@@ -1686,6 +1658,51 @@ ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 	return ReadBuffersCanStartIOOnce(buffer, nowait);
 }
 
+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static pg_attribute_always_inline void
+CountBufferHit(BufferAccessStrategy strategy,
+			   Relation rel, char persistence, SMgrRelation smgr,
+			   ForkNumber forknum, BlockNumber blocknum)
+{
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+}
+
 /*
  * Helper for WaitReadBuffers() that processes the results of a readv
  * operation, raising an error if necessary.
@@ -1981,25 +1998,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
-										  operation->smgr->smgr_rlocator.locator.spcOid,
-										  operation->smgr->smgr_rlocator.locator.dbOid,
-										  operation->smgr->smgr_rlocator.locator.relNumber,
-										  operation->smgr->smgr_rlocator.backend,
-										  true);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
-		else
-			pgBufferUsage.shared_blks_hit += 1;
-
-		if (operation->rel)
-			pgstat_count_buffer_hit(operation->rel);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		CountBufferHit(operation->strategy, operation->rel, persistence,
+					   operation->smgr, forknum,
+					   blocknum + operation->nblocks_done);
 	}
 	else
 	{
-- 
2.43.0



  [text/x-patch] v4-0006-Don-t-wait-for-already-in-progress-IO.patch (20.9K, 7-v4-0006-Don-t-wait-for-already-in-progress-IO.patch)
  download | inline diff:
From d6a2d6d3316f33f5a7dfdd1a7084ea230d26ae3b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v4 6/6] Don't wait for already in-progress IO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a backend attempts to start a read on a buffer and finds that I/O
is already in progress, it previously waited for that I/O to complete
before initiating reads for any other buffers. Although the backend must
still wait for the I/O to finish when later acquiring the buffer, it
should not need to wait at read start time. Other buffers may be
available for I/O, and in some workloads this waiting significantly
reduces concurrency.

For example, index scans may repeatedly request the same heap block. If
the backend waits each time it encounters an in-progress read, the
access pattern effectively degenerates into synchronous I/O. By
introducing the concept of foreign I/O operations, a backend can record
the buffer’s wait reference and defer waiting until WaitReadBuffers()
when it actually acquires the buffer.

In rare cases, a backend may still need to wait when starting a read if
it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
but before the buffer descriptor’s wait reference has been set. Such
windows should be brief and uncommon.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
---
 src/backend/storage/buffer/bufmgr.c | 485 ++++++++++++++++++----------
 src/include/storage/bufmgr.h        |   1 +
 src/tools/pgindent/typedefs.list    |   1 +
 3 files changed, 324 insertions(+), 163 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a749971ba7e..f8205c3b845 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -170,6 +170,21 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
+
+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBuffer_Status
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBuffer_Status;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1619,45 +1634,6 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
 #endif
 }
 
-/* helper for ReadBuffersCanStartIO(), to avoid repetition */
-static inline bool
-ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
-{
-	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
-								  true, nowait);
-	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
-}
-
-/*
- * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
- */
-static inline bool
-ReadBuffersCanStartIO(Buffer buffer, bool nowait)
-{
-	/*
-	 * If this backend currently has staged IO, we need to submit the pending
-	 * IO before waiting for the right to issue IO, to avoid the potential for
-	 * deadlocks (and, more commonly, unnecessary delays for other backends).
-	 */
-	if (!nowait && pgaio_have_staged())
-	{
-		if (ReadBuffersCanStartIOOnce(buffer, true))
-			return true;
-
-		/*
-		 * Unfortunately StartBufferIO() returning false doesn't allow to
-		 * distinguish between the buffer already being valid and IO already
-		 * being in progress. Since IO already being in progress is quite
-		 * rare, this approach seems fine.
-		 */
-		pgaio_submit_staged();
-	}
-
-	return ReadBuffersCanStartIOOnce(buffer, nowait);
-}
-
 /*
  * We track various stats related to buffer hits. Because this is done in a
  * few separate places, this helper exists for convenience.
@@ -1807,7 +1783,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1826,11 +1802,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint32		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					CountBufferHit(operation->strategy,
+								   operation->rel, operation->persistence,
+								   operation->smgr, operation->forknum,
+								   operation->blocknum + operation->nblocks_done);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1861,6 +1859,163 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
 }
 
+/*
+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+							Buffer buffer)
+{
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/* Already valid, no work to do */
+	if (buf_state & BM_VALID)
+	{
+		pgaio_wref_clear(&operation->io_wref);
+		return READ_BUFFER_ALREADY_DONE;
+	}
+
+	pgaio_submit_staged();
+
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
+	}
+
+	/*
+	 * While it is possible for a buffer to have been prepared for IO but not
+	 * yet had its wait reference set, there's no way for us to know that for
+	 * temporary buffers. Thus, we'll prepare for own IO on this buffer.
+	 */
+	return READ_BUFFER_READY_FOR_IO;
+}
+
+/*
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
+ *
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
+ */
+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+					   Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareNewLocalReadBufferIO(operation, buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
+	{
+		buf_state = LockBufHdr(desc);
+
+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			pgaio_wref_clear(&operation->io_wref);
+			return READ_BUFFER_ALREADY_DONE;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the IO completes.
+			 */
+			UnlockBufHdr(desc);
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;
+		}
+
+		/*
+		 * No IO in progress and not already valid; We will start IO. It's
+		 * possible that the IO was in progress and never became valid because
+		 * the IO errored out. We'll do the IO ourselves.
+		 */
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+									  BufferDescriptorGetBuffer(desc));
+
+		return READ_BUFFER_READY_FOR_IO;
+	}
+}
+
+
+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+*/
+static bool
+PrepareAdditionalReadBuffer(Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		/* Local buffers don't use BM_IO_IN_PROGRESS */
+		if (buf_state & BM_VALID || pgaio_wref_valid(&desc->io_wref))
+			return false;
+	}
+	else
+	{
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+		if (buf_state & (BM_VALID | BM_IO_IN_PROGRESS))
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner, buffer);
+	}
+
+	return true;
+}
+
 /*
  * Initiate IO for the ReadBuffersOperation
  *
@@ -1894,7 +2049,75 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		did_start_io;
+	instr_time	io_start;
+	PrepareReadBuffer_Status status;
+
+	/*
+	 * We must get an IO handle before StartNewBufferReadIO(), as
+	 * pgaio_io_acquire() might block, which we don't want after setting
+	 * IO_IN_PROGRESS. If we don't need to do the IO, we'll release the
+	 * handle.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	operation->foreign_io = false;
+
+	/* Check if we can start IO on the first to-be-read buffer */
+	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
+		READ_BUFFER_READY_FOR_IO)
+	{
+		pgaio_io_release(ioh);
+		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone else has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
+
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			CountBufferHit(operation->strategy,
+						   operation->rel, operation->persistence,
+						   operation->smgr, operation->forknum,
+						   operation->blocknum + operation->nblocks_done);
+			return false;
+		}
+
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		CheckReadBuffersOperation(operation, false);
+		return true;
+	}
+
+	/* We can read in at least the head buffer . */
+	Assert(status == READ_BUFFER_READY_FOR_IO);
 
 	/*
 	 * When this IO is executed synchronously, either because the caller will
@@ -1945,138 +2168,74 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	 */
 	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
 
-	/*
-	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
-	 * might block, which we don't want after setting IO_IN_PROGRESS.
-	 *
-	 * If we need to wait for IO before we can get a handle, submit
-	 * already-staged IO first, so that other backends don't need to wait.
-	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
-	 * wait for already submitted IO, which doesn't require additional locks,
-	 * but it could still cause undesirable waits.
-	 *
-	 * A secondary benefit is that this would allow us to measure the time in
-	 * pgaio_io_acquire() without causing undue timer overhead in the common,
-	 * non-blocking, case.  However, currently the pgstats infrastructure
-	 * doesn't really allow that, as it a) asserts that an operation can't
-	 * have time without operations b) doesn't have an API to report
-	 * "accumulated" time.
-	 */
-	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
-	if (unlikely(!ioh))
-	{
-		pgaio_submit_staged();
-
-		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
-	}
+	Assert(io_buffers[0] == buffers[nblocks_done]);
+	io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
+	io_buffers_len = 1;
 
 	/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
 	 */
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 	{
-		/*
-		 * Someone else has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
-		*nblocks_progress = 1;
-
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
+		if (!PrepareAdditionalReadBuffer(buffers[i]))
+			break;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);
+		Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-		/*
-		 * Report and track this as a 'hit' for this backend, even though it
-		 * must have started out as a miss in PinBufferForBlock(). The other
-		 * backend will track this as a 'read'.
-		 */
-		CountBufferHit(operation->strategy, operation->rel, persistence,
-					   operation->smgr, forknum,
-					   blocknum + operation->nblocks_done);
+		io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 	}
-	else
-	{
-		instr_time	io_start;
-
-		/* We found a buffer that we need to read in. */
-		Assert(io_buffers[0] == buffers[nblocks_done]);
-		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
-		io_buffers_len = 1;
-
-		/*
-		 * How many neighboring-on-disk blocks can we scatter-read into other
-		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
-		 * head block, so we should get on with that I/O as soon as possible.
-		 */
-		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
-		{
-			if (!ReadBuffersCanStartIO(buffers[i], true))
-				break;
-			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
-				   BufferGetBlockNumber(buffers[i]) - 1);
-			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
-		}
+	/* get a reference to wait for in WaitReadBuffers() */
+	pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* get a reference to wait for in WaitReadBuffers() */
-		pgaio_io_get_wref(ioh, &operation->io_wref);
+	/* provide the list of buffers to the completion callbacks */
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-		/* provide the list of buffers to the completion callbacks */
-		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+	pgaio_io_register_callbacks(ioh,
+								persistence == RELPERSISTENCE_TEMP ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								flags);
 
-		pgaio_io_register_callbacks(ioh,
-									persistence == RELPERSISTENCE_TEMP ?
-									PGAIO_HCB_LOCAL_BUFFER_READV :
-									PGAIO_HCB_SHARED_BUFFER_READV,
-									flags);
+	pgaio_io_set_flag(ioh, ioh_flags);
 
-		pgaio_io_set_flag(ioh, ioh_flags);
+	/* ---
+	* Even though we're trying to issue IO asynchronously, track the time
+	* in smgrstartreadv():
+	* - if io_method == IOMETHOD_SYNC, we will always perform the IO
+	*   immediately
+	* - the io method might not support the IO (e.g. worker IO for a temp
+	*   table)
+	* ---
+	*/
+	io_start = pgstat_prepare_io_time(track_io_timing);
+	smgrstartreadv(ioh, operation->smgr, forknum,
+				   blocknum + nblocks_done,
+				   io_pages, io_buffers_len);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+							io_start, 1, io_buffers_len * BLCKSZ);
 
-		/* ---
-		 * Even though we're trying to issue IO asynchronously, track the time
-		 * in smgrstartreadv():
-		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
-		 *   immediately
-		 * - the io method might not support the IO (e.g. worker IO for a temp
-		 *   table)
-		 * ---
-		 */
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum + nblocks_done,
-					   io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
-								io_start, 1, io_buffers_len * BLCKSZ);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
-		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += io_buffers_len;
+	else
+		pgBufferUsage.shared_blks_read += io_buffers_len;
 
-		/*
-		 * Track vacuum cost when issuing IO, not after waiting for it.
-		 * Otherwise we could end up issuing a lot of IO in a short timespan,
-		 * despite a low cost limit.
-		 */
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	/*
+	 * Track vacuum cost when issuing IO, not after waiting for it. Otherwise
+	 * we could end up issuing a lot of IO in a short timespan, despite a low
+	 * cost limit.
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
-		*nblocks_progress = io_buffers_len;
-		did_start_io = true;
-	}
+	*nblocks_progress = io_buffers_len;
 
-	return did_start_io;
+	return true;
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a40adf6b2a8..1358fc7fa64 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 668faaa5615..a656bbf9110 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2345,6 +2345,7 @@ PredicateLockData
 PredicateLockTargetType
 PrefetchBufferResult
 PrepParallelRestorePtrType
+PrepareReadBuffer_Status
 PrepareStmt
 PreparedStatement
 PresortedKeyData
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-03 20:07         ` Melanie Plageman <[email protected]>
  1 sibling, 0 replies; 31+ messages in thread

From: Melanie Plageman @ 2026-03-03 20:07 UTC (permalink / raw)
  To: Nazir Bilal Yavuz <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Tue, Mar 3, 2026 at 2:47 PM Melanie Plageman
<[email protected]> wrote:
>
> Some ideas I had that I didn't include in v4-0003 because its Andres
> patch and is subjective:

I was just looking at another patch and realized test_read_stream.c
exists. I wonder if any of the code this patch set adds to test_aio.c
should be there? On the one hand the foreign IO test is testing AIO
behavior and not really read stream behavior even though it invokes
the read stream. So maybe it doesn't really belong in
0004_read_stream.pl?  The repeated blocks test is more of a read
stream test. Anyway, just a thought I had.

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-06 13:18         ` Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Nazir Bilal Yavuz @ 2026-03-06 13:18 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On Tue, 3 Mar 2026 at 22:47, Melanie Plageman <[email protected]> wrote:
>
> On Thu, Feb 5, 2026 at 11:56 AM Nazir Bilal Yavuz <[email protected]> wrote:
> >
>
> I did finally review Andres' test patches and have included my review
> feedback here as well.
>
> "aio: Refactor tests in preparation for more tests" (v4-0001) looks
> good to me as well. I included one tiny idea AI suggested to me in a
> follow-on patch (v4-0002).

This makes sense.

> > diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
> > b/src/test/modules/test_aio/t/004_read_stream.pl
> > +foreach my $method (TestAio::supported_io_methods())
> > +{
> > +    $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
> > +    $node->start();
> > +    test_io_method($method, $node);
> > +    $node->stop();
> > +}
> >
> > This seems wrong, we always test io_method=worker. I think we need to
> > replace 'worker' with the $method. Also, we can add check below to the
> > test_io_method function in the 004_read_stream.pl:
> > ```
> >     is($node->safe_psql('postgres', 'SHOW io_method'),
> >         $io_method, "$io_method: io_method set correctly");
>
> Good catch. Fixed. I also found a few other small things in this patch
> (v4-0003) which I put in v4-0004.

These look good.

> Some ideas I had that I didn't include in v4-0003 because its Andres
> patch and is subjective:
>
> For test_repeated_blocks, the first test:
>
>     # test miss of the same block twice in a row
>     $psql->query_safe(
>         qq/
> SELECT evict_rel('largeish');
> /);
>     $psql->query_safe(
>         qq/
> SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
> /);
>     ok(1, "$io_method: stream missing the same block repeatedly");
>
> It says that it will miss the same block repeatedly, is that because
> we won't start a read for any of the blocks until after
> read_stream_get_block has returned all of them? If so, could be
> clearer in the comment. Not everyone understands all the read stream
> internals.

I think we start a read of blocks because we hit stream->distance but
it doesn't affect any consecutive same block numbers. What I
understood is:

Since io_combine_limit is 1, there won't be any IO combining.

0th block (0), miss, distance is 1; StartReadBuffersImpl() and
WaitReadBuffers() are called for 0th block.
1th block (2), miss, distance is 2, StartReadBuffersImpl() is called.
2th block (2), miss, distance is 2, StartReadBuffersImpl() and
WaitReadBuffers() are called 1th block.
3th block (4), miss, distance is 4, StartReadBuffersImpl() is called.
4th block (4), miss, distance is 4, StartReadBuffersImpl() and
WaitReadBuffers() are called 2, 3 and 4th blocks.

> I know a lot of other tests do this, but I find it so hard to read the
> test with the SQL is totally left-aligned like that -- especially with
> comments interspersed. You can easily flow the queries on multiple
> lines and indent it more.

I agree with you.

> For test_repeated_blocks, the second test:
>
>     # test hit of the same block twice in a row
>     $psql->query_safe(
>         qq/
> SELECT evict_rel('largeish');
> /);
>     $psql->query_safe(
>         qq/
> SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
> 5, 6, 5, 4, 3, 2, 1, 0]);
> /);
>     ok(1, "$io_method: stream accessing same block");
>
> I assume that the second access of 2 is a hit because we actually did
> IO for the first one (unlike in the earlier case)?

I think so but to clarify, all second access of [2, 1, 0] blocks are hit; right?

> For test_inject_foreign:
>
> In general, I am not ramped up enough on injection point stuff to know
> if the actual new injection point stuff you added in test_aio.c is is
> correct and optimal, but I did review the read stream additions to
> test_aio.c and the tests added to 004_read_stream.pl.
>
> For test_inject_foreign, the 3rd test:
>
>     # Test read stream encountering two buffers that are undergoing the same
>     # IO, started by another backend
>
> I see that psql_b is requesting 3 blocks which can be combined into 1
> IO, which makes it different than the 1st foreign IO test case:
>
>     ###
>     # Test read stream encountering buffers undergoing IO in another backend,
>     # with the other backend's reads succeeding.
>     ###
>
> where psql_b only requests 1 but I don't really see how these are
> covering different code. Maybe if the read stream one (psql_a) is
> doing a combined IO it might exercise slightly different code, but
> otherwise I don't get it.

I think the main difference is that:

>     ###
>     # Test read stream encountering buffers undergoing IO in another backend,
>     # with the other backend's reads succeeding.
>     ###

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 5, 7]);

We need to join waiting block number 5 and then start another IO for
block number 7.

>     # Test read stream encountering two buffers that are undergoing the same
>     # IO, started by another backend

SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
ARRAY[0, 2, 4]);

We need to join waiting block number 2 but after waiting for an IO, IO
for block number 4 should be already completed too. We don't need to
start IO like the other case.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
@ 2026-03-16 21:45           ` Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-16 21:45 UTC (permalink / raw)
  To: Nazir Bilal Yavuz <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Thanks for the continued review!

Attached v5 adds some comments to the tests, fixes a few nits in the
actual code, and adds a commit to fix what I think is an existing
off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.

On Fri, Mar 6, 2026 at 8:18 AM Nazir Bilal Yavuz <[email protected]> wrote:
>
> > For test_repeated_blocks, the first test:
> >
> >     # test miss of the same block twice in a row
> >     $psql->query_safe(
> >         qq/
> > SELECT evict_rel('largeish');
> > /);
> >     $psql->query_safe(
> >         qq/
> > SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 2, 2, 4, 4]);
> > /);
> >     ok(1, "$io_method: stream missing the same block repeatedly");
> >
> > It says that it will miss the same block repeatedly, is that because
> > we won't start a read for any of the blocks until after
> > read_stream_get_block has returned all of them? If so, could be
> > clearer in the comment. Not everyone understands all the read stream
> > internals.
>
> I think we start a read of blocks because we hit stream->distance but
> it doesn't affect any consecutive same block numbers. What I
> understood is:
>
> Since io_combine_limit is 1, there won't be any IO combining.
>
> 0th block (0), miss, distance is 1; StartReadBuffersImpl() and
> WaitReadBuffers() are called for 0th block.
> 1th block (2), miss, distance is 2, StartReadBuffersImpl() is called.
> 2th block (2), miss, distance is 2, StartReadBuffersImpl() and
> WaitReadBuffers() are called 1th block.
> 3th block (4), miss, distance is 4, StartReadBuffersImpl() is called.
> 4th block (4), miss, distance is 4, StartReadBuffersImpl() and
> WaitReadBuffers() are called 2, 3 and 4th blocks.

Makes sense. I've tried to add a comment to this effect.

> > I know a lot of other tests do this, but I find it so hard to read the
> > test with the SQL is totally left-aligned like that -- especially with
> > comments interspersed. You can easily flow the queries on multiple
> > lines and indent it more.
>
> I agree with you.

I did reflow the SQL. It does mean there will be a bunch of extra
whitespace sent to the server. Other tests do this, though. I wonder
how much it affects performance...

> > For test_repeated_blocks, the second test:
> >
> >     # test hit of the same block twice in a row
> >     $psql->query_safe(
> >         qq/
> > SELECT evict_rel('largeish');
> > /);
> >     $psql->query_safe(
> >         qq/
> > SELECT * FROM read_stream_for_blocks('largeish', ARRAY[0, 1, 2, 3, 4,
> > 5, 6, 5, 4, 3, 2, 1, 0]);
> > /);
> >     ok(1, "$io_method: stream accessing same block");
> >
> > I assume that the second access of 2 is a hit because we actually did
> > IO for the first one (unlike in the earlier case)?
>
> I think so but to clarify, all second access of [2, 1, 0] blocks are hit; right?

Yes. I tried expanding the comment to elaborate, but it just came out
awkward, so I left it the way it is.

> > For test_inject_foreign, the 3rd test:
> >
> >     # Test read stream encountering two buffers that are undergoing the same
> >     # IO, started by another backend
> >
> > I see that psql_b is requesting 3 blocks which can be combined into 1
> > IO, which makes it different than the 1st foreign IO test case:
> >
> >     ###
> >     # Test read stream encountering buffers undergoing IO in another backend,
> >     # with the other backend's reads succeeding.
> >     ###
> >
> > where psql_b only requests 1 but I don't really see how these are
> > covering different code. Maybe if the read stream one (psql_a) is
> > doing a combined IO it might exercise slightly different code, but
> > otherwise I don't get it.
>
> I think the main difference is that:
>
> >     ###
> >     # Test read stream encountering buffers undergoing IO in another backend,
> >     # with the other backend's reads succeeding.
> >     ###
>
> SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
> ARRAY[0, 2, 5, 7]);
>
> We need to join waiting block number 5 and then start another IO for
> block number 7.
>
> >     # Test read stream encountering two buffers that are undergoing the same
> >     # IO, started by another backend
>
> SELECT array_agg(blocknum) FROM read_stream_for_blocks('largeish',
> ARRAY[0, 2, 4]);
>
> We need to join waiting block number 2 but after waiting for an IO, IO
> for block number 4 should be already completed too. We don't need to
> start IO like the other case.

Ah, makes sense. Thanks!

- Melanie


Attachments:

  [text/x-patch] v5-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch (10.8K, 2-v5-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch)
  download | inline diff:
From fd1cb5a7d0e04ed70f387ed2c66670e3eff4f049 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 9 Sep 2025 10:14:34 -0400
Subject: [PATCH v5 1/5] aio: Refactor tests in preparation for more tests

In a future commit more AIO related tests are due to be introduced. However
001_aio.pl already is fairly large.

This commit introduces a new TestAio package with helpers for writing AIO
related tests. Then it uses the new helpers to simplify the existing
001_aio.pl by iterating over all supported io_methods. This will be
particularly helpful because additional methods already have been submitted.

Additionally this commit splits out testing of initdb using a non-default
method into its own test. While that test is somewhat important, it's fairly
slow and doesn't break that often. For development velocity it's helpful for
001_aio.pl to be faster.

While particularly the latter could benefit from being its own commit, it
seems to introduce more back-and-forth than it's worth.

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/
---
 src/test/modules/test_aio/meson.build     |   1 +
 src/test/modules/test_aio/t/001_aio.pl    | 141 +++++++---------------
 src/test/modules/test_aio/t/003_initdb.pl |  71 +++++++++++
 src/test/modules/test_aio/t/TestAio.pm    |  90 ++++++++++++++
 4 files changed, 204 insertions(+), 99 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/003_initdb.pl
 create mode 100644 src/test/modules/test_aio/t/TestAio.pm

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index fefa25bc5ab..18a797f3a3b 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -32,6 +32,7 @@ tests += {
     'tests': [
       't/001_aio.pl',
       't/002_io_workers.pl',
+      't/003_initdb.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 5c634ec3ca9..e18b2a2b8ae 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -7,126 +7,56 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use FindBin;
+use lib $FindBin::RealBin;
 
-###
-# Test io_method=worker
-###
-my $node_worker = create_node('worker');
-$node_worker->start();
-
-test_generic('worker', $node_worker);
-SKIP:
-{
-	skip 'Injection points not supported by this build', 1
-	  unless $ENV{enable_injection_points} eq 'yes';
-	test_inject_worker('worker', $node_worker);
-}
+use TestAio;
 
-$node_worker->stop();
+my @methods = TestAio::supported_io_methods();
+my %nodes;
 
 
 ###
-# Test io_method=io_uring
+# Create and configure one instance for each io_method
 ###
 
-if (have_io_uring())
+foreach my $method (@methods)
 {
-	my $node_uring = create_node('io_uring');
-	$node_uring->start();
-	test_generic('io_uring', $node_uring);
-	$node_uring->stop();
-}
-
-
-###
-# Test io_method=sync
-###
-
-my $node_sync = create_node('sync');
+	my $node = PostgreSQL::Test::Cluster->new($method);
 
-# just to have one test not use the default auto-tuning
+	$nodes{$method} = $node;
+	$node->init();
+	$node->append_conf('postgresql.conf', "io_method=$method");
+	TestAio::configure($node);
+}
 
-$node_sync->append_conf(
+# Just to have one test not use the default auto-tuning
+$nodes{'sync'}->append_conf(
 	'postgresql.conf', qq(
-io_max_concurrency=4
+ io_max_concurrency=4
 ));
 
-$node_sync->start();
-test_generic('sync', $node_sync);
-$node_sync->stop();
-
-done_testing();
-
 
 ###
-# Test Helpers
+# Execute the tests for each io_method
 ###
 
-sub create_node
+foreach my $method (@methods)
 {
-	local $Test::Builder::Level = $Test::Builder::Level + 1;
-
-	my $io_method = shift;
+	my $node = $nodes{$method};
 
-	my $node = PostgreSQL::Test::Cluster->new($io_method);
-
-	# Want to test initdb for each IO method, otherwise we could just reuse
-	# the cluster.
-	#
-	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
-	# options specified by ->extra, if somebody puts -c io_method=xyz in
-	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
-	# detect it.
-	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
-	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
-		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
-	{
-		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
-	}
-
-	$node->init(extra => [ '-c', "io_method=$io_method" ]);
-
-	$node->append_conf(
-		'postgresql.conf', qq(
-shared_preload_libraries=test_aio
-log_min_messages = 'DEBUG3'
-log_statement=all
-log_error_verbosity=default
-restart_after_crash=false
-temp_buffers=100
-));
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
 
-	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
-	# io_method, it'd override the setting persisted at initdb time. While
-	# using (and later verifying) the setting from initdb provides some
-	# verification of having used the io_method during initdb, it's probably
-	# not worth the complication of only appending if the variable is set in
-	# in TEMP_CONFIG.
-	$node->append_conf(
-		'postgresql.conf', qq(
-io_method=$io_method
-));
+done_testing();
 
-	ok(1, "$io_method: initdb");
 
-	return $node;
-}
+###
+# Test Helpers
+###
 
-sub have_io_uring
-{
-	# To detect if io_uring is supported, we look at the error message for
-	# assigning an invalid value to an enum GUC, which lists all the valid
-	# options. We need to use -C to deal with running as administrator on
-	# windows, the superuser check is omitted if -C is used.
-	my ($stdout, $stderr) =
-	  run_command [qw(postgres -C invalid -c io_method=invalid)];
-	die "can't determine supported io_method values"
-	  unless $stderr =~ m/Available values: ([^\.]+)\./;
-	my $methods = $1;
-	note "supported io_method values are: $methods";
-
-	return ($methods =~ m/io_uring/) ? 1 : 0;
-}
 
 sub psql_like
 {
@@ -1490,8 +1420,8 @@ SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
 }
 
 
-# Run all tests that are supported for all io_methods
-sub test_generic
+# Run all tests that for the specified node / io_method
+sub test_io_method
 {
 	my $io_method = shift;
 	my $node = shift;
@@ -1526,10 +1456,23 @@ CHECKPOINT;
 	test_ignore_checksum($io_method, $node);
 	test_checksum_createdb($io_method, $node);
 
+	# generic injection tests
   SKIP:
 	{
 		skip 'Injection points not supported by this build', 1
 		  unless $ENV{enable_injection_points} eq 'yes';
 		test_inject($io_method, $node);
 	}
+
+	# worker specific injection tests
+	if ($io_method eq 'worker')
+	{
+	  SKIP:
+		{
+			skip 'Injection points not supported by this build', 1
+			  unless $ENV{enable_injection_points} eq 'yes';
+
+			test_inject_worker($io_method, $node);
+		}
+	}
 }
diff --git a/src/test/modules/test_aio/t/003_initdb.pl b/src/test/modules/test_aio/t/003_initdb.pl
new file mode 100644
index 00000000000..c03ae58d00a
--- /dev/null
+++ b/src/test/modules/test_aio/t/003_initdb.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test initdb for each IO method. This is done separately from 001_aio.pl, as
+# it isn't fast. This way the more commonly failing / hacked-on 001_aio.pl can
+# be iterated on more quickly.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	test_create_node($method);
+}
+
+done_testing();
+
+
+sub test_create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	TestAio::configure($node);
+
+	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
+	# io_method, it'd override the setting persisted at initdb time. While
+	# using (and later verifying) the setting from initdb provides some
+	# verification of having used the io_method during initdb, it's probably
+	# not worth the complication of only appending if the variable is set in
+	# in TEMP_CONFIG.
+	$node->append_conf(
+		'postgresql.conf', qq(
+io_method=$io_method
+));
+
+	ok(1, "$io_method: initdb");
+
+	$node->start();
+	$node->stop();
+	ok(1, "$io_method: start & stop");
+
+	return $node;
+}
diff --git a/src/test/modules/test_aio/t/TestAio.pm b/src/test/modules/test_aio/t/TestAio.pm
new file mode 100644
index 00000000000..5bc80a9b130
--- /dev/null
+++ b/src/test/modules/test_aio/t/TestAio.pm
@@ -0,0 +1,90 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+TestAio - helpers for writing AIO related tests
+
+=cut
+
+package TestAio;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item TestAio::supported_io_methods()
+
+Return an array of all the supported values for the io_method GUC
+
+=cut
+
+sub supported_io_methods()
+{
+	my @io_methods = ('worker');
+
+	push(@io_methods, "io_uring") if have_io_uring();
+
+	# Return sync last, as it will least commonly fail
+	push(@io_methods, "sync");
+
+	return @io_methods;
+}
+
+
+=item TestAio::configure()
+
+Prepare a cluster for AIO test
+
+=cut
+
+sub configure
+{
+	my $node = shift;
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+}
+
+
+=pod
+
+=item TestAio::have_io_uring()
+
+Return if io_uring is supported
+
+=cut
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+1;
-- 
2.43.0



  [text/x-patch] v5-0002-test_aio-Add-read_stream-test-infrastructure-test.patch (23.1K, 3-v5-0002-test_aio-Add-read_stream-test-infrastructure-test.patch)
  download | inline diff:
From 43e0c31a5c262686243ee2ed5617954b5361a3ba Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 10 Sep 2025 14:00:02 -0400
Subject: [PATCH v5 2/5] test_aio: Add read_stream test infrastructure & tests

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/test/modules/test_aio/meson.build         |   1 +
 .../modules/test_aio/t/004_read_stream.pl     | 261 +++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql   |  24 +-
 src/test/modules/test_aio/test_aio.c          | 346 +++++++++++++++---
 src/tools/pgindent/typedefs.list              |   1 +
 5 files changed, 582 insertions(+), 51 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/004_read_stream.pl

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index 18a797f3a3b..909f81d96c1 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -33,6 +33,7 @@ tests += {
       't/001_aio.pl',
       't/002_io_workers.pl',
       't/003_initdb.pl',
+      't/004_read_stream.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
new file mode 100644
index 00000000000..755d6dfc030
--- /dev/null
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -0,0 +1,261 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init();
+
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+max_connections=8
+io_method=worker
+));
+
+$node->start();
+test_setup($node);
+$node->stop();
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	$node->adjust_conf('postgresql.conf', 'io_method', $method);
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
+
+done_testing();
+
+
+sub test_setup
+{
+	my $node = shift;
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+
+CREATE TABLE largeish(k int not null) WITH (FILLFACTOR=10);
+INSERT INTO largeish(k) SELECT generate_series(1, 10000);
+));
+	ok(1, "setup");
+}
+
+
+sub test_repeated_blocks
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Preventing larger reads makes testing easier
+	$psql->query_safe(
+		qq/ SET io_combine_limit = 1; /);
+
+	# test miss of the same block twice in a row
+	$psql->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	# block 0 grows the distance enough that the stream will look ahead and try
+	# to start a pending read for block 2 (and later block 4) twice before
+	# returning any buffers.
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 2, 2, 4, 4]); /);
+
+	ok(1, "$io_method: stream missing the same block repeatedly");
+
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 2, 2, 4, 4]); /);
+	ok(1, "$io_method: stream hitting the same block repeatedly");
+
+	# test hit of the same block twice in a row
+	$psql->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 0]); /);
+	ok(1, "$io_method: stream accessing same block");
+
+	$psql->quit();
+}
+
+
+sub test_inject_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	my $pid_a = $psql_a->query_safe(qq/SELECT pg_backend_pid();/);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads succeeding.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>5, nblocks=>1);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	# Block 5 is undergoing IO in session b, so session a will move on to start
+	# a new IO for block 7.
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	ok(1, qq/$io_method: read stream encounters succeeding IO by another backend/);
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads failing.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_short_read_attach(-errno_from_string('EIO'),
+			pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>5, nblocks=>1);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	pump_until(
+		$psql_b->{run}, $psql_b->{timeout},
+		\$psql_b->{stderr}, qr/ERROR.*could not read blocks 5\.\.5/);
+	ok(1, "$io_method: injected error occurred");
+	$psql_b->{stderr} = '';
+	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
+
+	ok(1,
+		qq/$io_method: read stream encounters failing IO by another backend/);
+
+
+	###
+	# Test read stream encountering two buffers that are undergoing the same
+	# IO, started by another backend.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>2, nblocks=>3);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	# Blocks 2 and 4 are undergoing IO initiated by session a
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 4]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres',
+		qq/ SELECT inj_io_completion_continue() /);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,4\}/);
+
+	ok(1, qq/$io_method: read stream encounters two buffer read in one IO/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+
+sub test_io_method
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	test_repeated_blocks($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject_foreign($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..1cc4734a746 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -33,6 +33,10 @@ CREATE FUNCTION read_rel_block_ll(
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
+CREATE FUNCTION evict_rel(rel regclass)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
 CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
@@ -50,6 +54,14 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+/*
+ * Read stream related functions
+ */
+CREATE FUNCTION read_stream_for_blocks(rel regclass, blocks int4[], OUT blockoff int4, OUT blocknum int4, OUT buf int4)
+RETURNS SETOF record STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Handle related functions
@@ -91,8 +103,16 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 /*
  * Injection point related functions
  */
-CREATE FUNCTION inj_io_short_read_attach(result int)
-RETURNS pg_catalog.void STRICT
+CREATE FUNCTION inj_io_completion_wait(pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_completion_continue()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_attach(result int, pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION inj_io_short_read_detach()
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index b1aa8af9ec0..061f0c9f92a 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -19,17 +19,26 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "catalog/pg_type.h"
 #include "fmgr.h"
+#include "funcapi.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "storage/read_stream.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/rel.h"
+#include "utils/tuplestore.h"
+#include "utils/wait_event.h"
 
 
 PG_MODULE_MAGIC;
@@ -37,13 +46,30 @@ PG_MODULE_MAGIC;
 
 typedef struct InjIoErrorState
 {
+	ConditionVariable cv;
+
 	bool		enabled_short_read;
 	bool		enabled_reopen;
 
+	bool		enabled_completion_wait;
+	Oid			completion_wait_relfilenode;
+	pid_t		completion_wait_pid;
+	uint32		completion_wait_event;
+
 	bool		short_read_result_set;
+	Oid			short_read_relfilenode;
+	pid_t		short_read_pid;
 	int			short_read_result;
 } InjIoErrorState;
 
+typedef struct BlocksReadStreamData
+{
+	int			nblocks;
+	int			curblock;
+	uint32	   *blocks;
+} BlocksReadStreamData;
+
+
 static InjIoErrorState *inj_io_error_state;
 
 /* Shared memory init callbacks */
@@ -85,10 +111,13 @@ test_aio_shmem_startup(void)
 		inj_io_error_state->enabled_short_read = false;
 		inj_io_error_state->enabled_reopen = false;
 
+		ConditionVariableInit(&inj_io_error_state->cv);
+		inj_io_error_state->completion_wait_event = WaitEventInjectionPointNew("completion_wait");
+
 #ifdef USE_INJECTION_POINTS
 		InjectionPointAttach("aio-process-completion-before-shared",
 							 "test_aio",
-							 "inj_io_short_read",
+							 "inj_io_completion_hook",
 							 NULL,
 							 0);
 		InjectionPointLoad("aio-process-completion-before-shared");
@@ -384,7 +413,7 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
 		elog(ERROR, "nblocks is out of range");
 
-	rel = relation_open(relid, AccessExclusiveLock);
+	rel = relation_open(relid, AccessShareLock);
 
 	for (int i = 0; i < nblocks; i++)
 	{
@@ -458,6 +487,27 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(evict_rel);
+Datum
+evict_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	int32		buffers_evicted,
+				buffers_flushed,
+				buffers_skipped;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	EvictRelUnpinnedBuffers(rel, &buffers_evicted, &buffers_flushed,
+							&buffers_skipped);
+
+	relation_close(rel, AccessExclusiveLock);
+
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(invalidate_rel_block);
 Datum
 invalidate_rel_block(PG_FUNCTION_ARGS)
@@ -610,6 +660,86 @@ buffer_call_terminate_io(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+
+static BlockNumber
+read_stream_for_blocks_cb(ReadStream *stream,
+						  void *callback_private_data,
+						  void *per_buffer_data)
+{
+	BlocksReadStreamData *stream_data = callback_private_data;
+
+	if (stream_data->curblock >= stream_data->nblocks)
+		return InvalidBlockNumber;
+	return stream_data->blocks[stream_data->curblock++];
+}
+
+PG_FUNCTION_INFO_V1(read_stream_for_blocks);
+Datum
+read_stream_for_blocks(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *blocksarray = PG_GETARG_ARRAYTYPE_P(1);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Relation	rel;
+	BlocksReadStreamData stream_data;
+	ReadStream *stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/*
+	 * We expect the input to be an N-element int4 array; verify that. We
+	 * don't need to use deconstruct_array() since the array data is just
+	 * going to look like a C array of N int4 values.
+	 */
+	if (ARR_NDIM(blocksarray) != 1 ||
+		ARR_HASNULL(blocksarray) ||
+		ARR_ELEMTYPE(blocksarray) != INT4OID)
+		elog(ERROR, "expected 1 dimensional int4 array");
+
+	stream_data.curblock = 0;
+	stream_data.nblocks = ARR_DIMS(blocksarray)[0];
+	stream_data.blocks = (uint32 *) ARR_DATA_PTR(blocksarray);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										read_stream_for_blocks_cb,
+										&stream_data,
+										0);
+
+	for (int i = 0; i < stream_data.nblocks; i++)
+	{
+		Buffer		buf = read_stream_next_buffer(stream, NULL);
+		Datum		values[3] = {0};
+		bool		nulls[3] = {0};
+
+		if (!BufferIsValid(buf))
+			elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly invalid", i);
+
+		values[0] = Int32GetDatum(i);
+		values[1] = UInt32GetDatum(stream_data.blocks[i]);
+		values[2] = UInt32GetDatum(buf);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+
+		ReleaseBuffer(buf);
+	}
+
+	if (read_stream_next_buffer(stream, NULL) != InvalidBuffer)
+		elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly valid",
+			 stream_data.nblocks + 1);
+
+	read_stream_end(stream);
+
+	relation_close(rel, NoLock);
+
+	return (Datum) 0;
+}
+
+
 PG_FUNCTION_INFO_V1(handle_get);
 Datum
 handle_get(PG_FUNCTION_ARGS)
@@ -680,15 +810,98 @@ batch_end(PG_FUNCTION_ARGS)
 }
 
 #ifdef USE_INJECTION_POINTS
-extern PGDLLEXPORT void inj_io_short_read(const char *name,
-										  const void *private_data,
-										  void *arg);
+extern PGDLLEXPORT void inj_io_completion_hook(const char *name,
+											   const void *private_data,
+											   void *arg);
 extern PGDLLEXPORT void inj_io_reopen(const char *name,
 									  const void *private_data,
 									  void *arg);
 
-void
-inj_io_short_read(const char *name, const void *private_data, void *arg)
+static bool
+inj_io_short_read_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_short_read)
+		return false;
+
+	if (!inj_io_error_state->short_read_result_set)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->short_read_pid != 0 &&
+		inj_io_error_state->short_read_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->short_read_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->short_read_relfilenode)
+		return false;
+
+	/*
+	 * Only shorten reads that are actually longer than the target size,
+	 * otherwise we can trigger over-reads.
+	 */
+	if (inj_io_error_state->short_read_result >= ioh->result)
+		return false;
+
+	return true;
+}
+
+static bool
+inj_io_completion_wait_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_completion_wait)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->completion_wait_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->completion_wait_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->completion_wait_relfilenode)
+		return false;
+
+	return true;
+}
+
+static void
+inj_io_completion_wait_hook(const char *name, const void *private_data, void *arg)
+{
+	PgAioHandle *ioh = (PgAioHandle *) arg;
+
+	if (!inj_io_completion_wait_matches(ioh))
+		return;
+
+	ConditionVariablePrepareToSleep(&inj_io_error_state->cv);
+
+	while (true)
+	{
+		if (!inj_io_completion_wait_matches(ioh))
+			break;
+
+		ConditionVariableSleep(&inj_io_error_state->cv,
+							   inj_io_error_state->completion_wait_event);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+static void
+inj_io_short_read_hook(const char *name, const void *private_data, void *arg)
 {
 	PgAioHandle *ioh = (PgAioHandle *) arg;
 
@@ -697,58 +910,56 @@ inj_io_short_read(const char *name, const void *private_data, void *arg)
 				   inj_io_error_state->enabled_reopen),
 			errhidestmt(true), errhidecontext(true));
 
-	if (inj_io_error_state->enabled_short_read)
+	if (inj_io_short_read_matches(ioh))
 	{
+		struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+		int32		old_result = ioh->result;
+		int32		new_result = inj_io_error_state->short_read_result;
+		int32		processed = 0;
+
+		ereport(LOG,
+				errmsg("short read inject point, changing result from %d to %d",
+					   old_result, new_result),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
-		 * Only shorten reads that are actually longer than the target size,
-		 * otherwise we can trigger over-reads.
+		 * The underlying IO actually completed OK, and thus the "invalid"
+		 * portion of the IOV actually contains valid data. That can hide a
+		 * lot of problems, e.g. if we were to wrongly mark a buffer, that
+		 * wasn't read according to the shortened-read, IO as valid, the
+		 * contents would look valid and we might miss a bug.
+		 *
+		 * To avoid that, iterate through the IOV and zero out the "failed"
+		 * portion of the IO.
 		 */
-		if (inj_io_error_state->short_read_result_set
-			&& ioh->op == PGAIO_OP_READV
-			&& inj_io_error_state->short_read_result <= ioh->result)
+		for (int i = 0; i < ioh->op_data.read.iov_length; i++)
 		{
-			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
-			int32		old_result = ioh->result;
-			int32		new_result = inj_io_error_state->short_read_result;
-			int32		processed = 0;
-
-			ereport(LOG,
-					errmsg("short read inject point, changing result from %d to %d",
-						   old_result, new_result),
-					errhidestmt(true), errhidecontext(true));
-
-			/*
-			 * The underlying IO actually completed OK, and thus the "invalid"
-			 * portion of the IOV actually contains valid data. That can hide
-			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
-			 * that wasn't read according to the shortened-read, IO as valid,
-			 * the contents would look valid and we might miss a bug.
-			 *
-			 * To avoid that, iterate through the IOV and zero out the
-			 * "failed" portion of the IO.
-			 */
-			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			if (processed + iov[i].iov_len <= new_result)
+				processed += iov[i].iov_len;
+			else if (processed <= new_result)
 			{
-				if (processed + iov[i].iov_len <= new_result)
-					processed += iov[i].iov_len;
-				else if (processed <= new_result)
-				{
-					uint32		ok_part = new_result - processed;
-
-					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
-					processed += iov[i].iov_len;
-				}
-				else
-				{
-					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
-				}
-			}
+				uint32		ok_part = new_result - processed;
 
-			ioh->result = new_result;
+				memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+				processed += iov[i].iov_len;
+			}
+			else
+			{
+				memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+			}
 		}
+
+		ioh->result = new_result;
 	}
 }
 
+void
+inj_io_completion_hook(const char *name, const void *private_data, void *arg)
+{
+	inj_io_completion_wait_hook(name, private_data, arg);
+	inj_io_short_read_hook(name, private_data, arg);
+}
+
 void
 inj_io_reopen(const char *name, const void *private_data, void *arg)
 {
@@ -762,6 +973,39 @@ inj_io_reopen(const char *name, const void *private_data, void *arg)
 }
 #endif
 
+PG_FUNCTION_INFO_V1(inj_io_completion_wait);
+Datum
+inj_io_completion_wait(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = true;
+	inj_io_error_state->completion_wait_pid =
+		PG_ARGISNULL(0) ? 0 : PG_GETARG_INT32(0);
+	inj_io_error_state->completion_wait_relfilenode =
+		PG_ARGISNULL(1) ? InvalidOid : PG_GETARG_OID(1);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_completion_continue);
+Datum
+inj_io_completion_continue(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = false;
+	inj_io_error_state->completion_wait_pid = 0;
+	inj_io_error_state->completion_wait_relfilenode = InvalidOid;
+	ConditionVariableBroadcast(&inj_io_error_state->cv);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
 Datum
 inj_io_short_read_attach(PG_FUNCTION_ARGS)
@@ -771,6 +1015,10 @@ inj_io_short_read_attach(PG_FUNCTION_ARGS)
 	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
 	if (inj_io_error_state->short_read_result_set)
 		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+	inj_io_error_state->short_read_pid =
+		PG_ARGISNULL(1) ? 0 : PG_GETARG_INT32(1);
+	inj_io_error_state->short_read_relfilenode =
+		PG_ARGISNULL(2) ? 0 : PG_GETARG_OID(2);
 #else
 	elog(ERROR, "injection points not supported");
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..9036fef129b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -305,6 +305,7 @@ BlockSampler
 BlockSamplerData
 BlockedProcData
 BlockedProcsData
+BlocksReadStreamData
 BlocktableEntry
 BloomBuildState
 BloomFilter
-- 
2.43.0



  [text/x-patch] v5-0003-Fix-off-by-one-error-in-read-IO-tracing.patch (1.1K, 4-v5-0003-Fix-off-by-one-error-in-read-IO-tracing.patch)
  download | inline diff:
From 17e85c3deaf8b88145cf4a09763ae17f4f9bd274 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 16 Mar 2026 16:50:56 -0400
Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing

---
 src/backend/storage/buffer/bufmgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..0723d4f3dd8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
 										  operation->smgr->smgr_rlocator.locator.spcOid,
 										  operation->smgr->smgr_rlocator.locator.dbOid,
 										  operation->smgr->smgr_rlocator.locator.relNumber,
-- 
2.43.0



  [text/x-patch] v5-0004-Make-buffer-hit-helper.patch (6.0K, 5-v5-0004-Make-buffer-hit-helper.patch)
  download | inline diff:
From 875a678a953865e6596c779f468c6649d6006d59 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:54:02 -0500
Subject: [PATCH v5 4/5] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
 1 file changed, 56 insertions(+), 55 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0723d4f3dd8..399004c2e44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+
+static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
+													  Relation rel, char persistence, SMgrRelation smgr,
+													  ForkNumber forknum, BlockNumber blocknum);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
@@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
 
@@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
 			persistence == RELPERSISTENCE_PERMANENT ||
 			persistence == RELPERSISTENCE_UNLOGGED));
 
-	if (persistence == RELPERSISTENCE_TEMP)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.backend);
 
 	if (persistence == RELPERSISTENCE_TEMP)
-	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
 	else
-	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+							 strategy, foundPtr, IOContextForStrategy(strategy));
+
 	if (rel)
 	{
 		/*
@@ -1274,22 +1258,10 @@ PinBufferForBlock(Relation rel,
 		 * zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
-		if (*foundPtr)
-			pgstat_count_buffer_hit(rel);
 	}
-	if (*foundPtr)
-	{
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  true);
-	}
+	if (*foundPtr)
+		CountBufferHit(strategy, rel, persistence, smgr, forkNum, blockNum);
 
 	return BufferDescriptorGetBuffer(bufHdr);
 }
@@ -1695,6 +1667,51 @@ ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 	return ReadBuffersCanStartIOOnce(buffer, nowait);
 }
 
+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static pg_attribute_always_inline void
+CountBufferHit(BufferAccessStrategy strategy,
+			   Relation rel, char persistence, SMgrRelation smgr,
+			   ForkNumber forknum, BlockNumber blocknum)
+{
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+}
+
 /*
  * Helper for WaitReadBuffers() that processes the results of a readv
  * operation, raising an error if necessary.
@@ -1990,25 +2007,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
-										  operation->smgr->smgr_rlocator.locator.spcOid,
-										  operation->smgr->smgr_rlocator.locator.dbOid,
-										  operation->smgr->smgr_rlocator.locator.relNumber,
-										  operation->smgr->smgr_rlocator.backend,
-										  true);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
-		else
-			pgBufferUsage.shared_blks_hit += 1;
-
-		if (operation->rel)
-			pgstat_count_buffer_hit(operation->rel);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		CountBufferHit(operation->strategy, operation->rel, persistence,
+					   operation->smgr, forknum,
+					   blocknum + operation->nblocks_done - 1);
 	}
 	else
 	{
-- 
2.43.0



  [text/x-patch] v5-0005-Don-t-wait-for-already-in-progress-IO.patch (21.8K, 6-v5-0005-Don-t-wait-for-already-in-progress-IO.patch)
  download | inline diff:
From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a backend attempts to start a read on a buffer and finds that I/O
is already in progress, it previously waited for that I/O to complete
before initiating reads for any other buffers. Although the backend must
still wait for the I/O to finish when later acquiring the buffer, it
should not need to wait at read start time. Other buffers may be
available for I/O, and in some workloads this waiting significantly
reduces concurrency.

For example, index scans may repeatedly request the same heap block. If
the backend waits each time it encounters an in-progress read, the
access pattern effectively degenerates into synchronous I/O. By
introducing the concept of foreign I/O operations, a backend can record
the buffer’s wait reference and defer waiting until WaitReadBuffers()
when it actually acquires the buffer.

In rare cases, a backend may still need to wait when starting a read if
it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
but before the buffer descriptor’s wait reference has been set. Such
windows should be brief and uncommon.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/backend/storage/buffer/bufmgr.c | 491 ++++++++++++++++++----------
 src/include/storage/bufmgr.h        |   2 +
 src/tools/pgindent/typedefs.list    |   1 +
 3 files changed, 330 insertions(+), 164 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 399004c2e44..20c36ccead0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -185,6 +185,21 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
+
+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBuffer_Status
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBuffer_Status;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1628,45 +1643,6 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
 #endif
 }
 
-/* helper for ReadBuffersCanStartIO(), to avoid repetition */
-static inline bool
-ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
-{
-	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
-								  true, nowait);
-	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
-}
-
-/*
- * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
- */
-static inline bool
-ReadBuffersCanStartIO(Buffer buffer, bool nowait)
-{
-	/*
-	 * If this backend currently has staged IO, we need to submit the pending
-	 * IO before waiting for the right to issue IO, to avoid the potential for
-	 * deadlocks (and, more commonly, unnecessary delays for other backends).
-	 */
-	if (!nowait && pgaio_have_staged())
-	{
-		if (ReadBuffersCanStartIOOnce(buffer, true))
-			return true;
-
-		/*
-		 * Unfortunately StartBufferIO() returning false doesn't allow to
-		 * distinguish between the buffer already being valid and IO already
-		 * being in progress. Since IO already being in progress is quite
-		 * rare, this approach seems fine.
-		 */
-		pgaio_submit_staged();
-	}
-
-	return ReadBuffersCanStartIOOnce(buffer, nowait);
-}
-
 /*
  * We track various stats related to buffer hits. Because this is done in a
  * few separate places, this helper exists for convenience.
@@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 * b) reports some time as waiting, even if we never waited
 			 *
 			 * we first check if we already know the IO is complete.
+			 *
+			 * Note that operation->io_return is uninitialized for foreign IO,
+			 * so we cannot count that wait time.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint32		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					CountBufferHit(operation->strategy,
+								   operation->rel, operation->persistence,
+								   operation->smgr, operation->forknum,
+								   operation->blocknum + operation->nblocks_done - 1);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1870,6 +1871,163 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
 }
 
+/*
+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+							Buffer buffer)
+{
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/* Already valid, no work to do */
+	if (buf_state & BM_VALID)
+	{
+		pgaio_wref_clear(&operation->io_wref);
+		return READ_BUFFER_ALREADY_DONE;
+	}
+
+	pgaio_submit_staged();
+
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
+	}
+
+	/*
+	 * While it is possible for a buffer to have been prepared for IO but not
+	 * yet had its wait reference set, there's no way for us to know that for
+	 * temporary buffers. Thus, we'll prepare for own IO on this buffer.
+	 */
+	return READ_BUFFER_READY_FOR_IO;
+}
+
+/*
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
+ *
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
+ */
+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+					   Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareNewLocalReadBufferIO(operation, buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
+	{
+		buf_state = LockBufHdr(desc);
+
+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			pgaio_wref_clear(&operation->io_wref);
+			return READ_BUFFER_ALREADY_DONE;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the IO completes.
+			 */
+			UnlockBufHdr(desc);
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;
+		}
+
+		/*
+		 * No IO in progress and not already valid; We will start IO. It's
+		 * possible that the IO was in progress and never became valid because
+		 * the IO errored out. We'll do the IO ourselves.
+		 */
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+									  BufferDescriptorGetBuffer(desc));
+
+		return READ_BUFFER_READY_FOR_IO;
+	}
+}
+
+
+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+ */
+static bool
+PrepareAdditionalReadBuffer(Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		/* Local buffers don't use BM_IO_IN_PROGRESS */
+		if (buf_state & BM_VALID || pgaio_wref_valid(&desc->io_wref))
+			return false;
+	}
+	else
+	{
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+		if (buf_state & (BM_VALID | BM_IO_IN_PROGRESS))
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner, buffer);
+	}
+
+	return true;
+}
+
 /*
  * Initiate IO for the ReadBuffersOperation
  *
@@ -1885,7 +2043,8 @@ WaitReadBuffers(ReadBuffersOperation *operation)
  * affected by the call. If the first buffer is valid, *nblocks_progress is
  * set to 1 and operation->nblocks_done is incremented.
  *
- * Returns true if IO was initiated, false if no IO was necessary.
+ * Returns true if IO was initiated or is already in progress (foreign IO),
+ * false if the buffer was already valid.
  */
 static bool
 AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
@@ -1903,7 +2062,75 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		did_start_io;
+	instr_time	io_start;
+	PrepareReadBuffer_Status status;
+
+	/*
+	 * We must get an IO handle before PrepareNewReadBufferIO(), as
+	 * pgaio_io_acquire() might block, which we don't want after setting
+	 * IO_IN_PROGRESS. If we don't need to do the IO, we'll release the
+	 * handle.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	operation->foreign_io = false;
+
+	/* Check if we can start IO on the first to-be-read buffer */
+	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
+		READ_BUFFER_READY_FOR_IO)
+	{
+		pgaio_io_release(ioh);
+		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone else has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
+
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			CountBufferHit(operation->strategy,
+						   operation->rel, operation->persistence,
+						   operation->smgr, operation->forknum,
+						   operation->blocknum + operation->nblocks_done - 1);
+			return false;
+		}
+
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		CheckReadBuffersOperation(operation, false);
+		return true;
+	}
+
+	/* We can read in at least the head buffer */
+	Assert(status == READ_BUFFER_READY_FOR_IO);
 
 	/*
 	 * When this IO is executed synchronously, either because the caller will
@@ -1954,138 +2181,74 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	 */
 	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
 
-	/*
-	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
-	 * might block, which we don't want after setting IO_IN_PROGRESS.
-	 *
-	 * If we need to wait for IO before we can get a handle, submit
-	 * already-staged IO first, so that other backends don't need to wait.
-	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
-	 * wait for already submitted IO, which doesn't require additional locks,
-	 * but it could still cause undesirable waits.
-	 *
-	 * A secondary benefit is that this would allow us to measure the time in
-	 * pgaio_io_acquire() without causing undue timer overhead in the common,
-	 * non-blocking, case.  However, currently the pgstats infrastructure
-	 * doesn't really allow that, as it a) asserts that an operation can't
-	 * have time without operations b) doesn't have an API to report
-	 * "accumulated" time.
-	 */
-	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
-	if (unlikely(!ioh))
-	{
-		pgaio_submit_staged();
-
-		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
-	}
+	Assert(io_buffers[0] == buffers[nblocks_done]);
+	io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
+	io_buffers_len = 1;
 
 	/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
 	 */
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 	{
-		/*
-		 * Someone else has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
-		*nblocks_progress = 1;
-
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
+		if (!PrepareAdditionalReadBuffer(buffers[i]))
+			break;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);
+		Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-		/*
-		 * Report and track this as a 'hit' for this backend, even though it
-		 * must have started out as a miss in PinBufferForBlock(). The other
-		 * backend will track this as a 'read'.
-		 */
-		CountBufferHit(operation->strategy, operation->rel, persistence,
-					   operation->smgr, forknum,
-					   blocknum + operation->nblocks_done - 1);
+		io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 	}
-	else
-	{
-		instr_time	io_start;
-
-		/* We found a buffer that we need to read in. */
-		Assert(io_buffers[0] == buffers[nblocks_done]);
-		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
-		io_buffers_len = 1;
-
-		/*
-		 * How many neighboring-on-disk blocks can we scatter-read into other
-		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
-		 * head block, so we should get on with that I/O as soon as possible.
-		 */
-		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
-		{
-			if (!ReadBuffersCanStartIO(buffers[i], true))
-				break;
-			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
-				   BufferGetBlockNumber(buffers[i]) - 1);
-			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
-		}
+	/* get a reference to wait for in WaitReadBuffers() */
+	pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* get a reference to wait for in WaitReadBuffers() */
-		pgaio_io_get_wref(ioh, &operation->io_wref);
+	/* provide the list of buffers to the completion callbacks */
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-		/* provide the list of buffers to the completion callbacks */
-		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+	pgaio_io_register_callbacks(ioh,
+								persistence == RELPERSISTENCE_TEMP ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								flags);
 
-		pgaio_io_register_callbacks(ioh,
-									persistence == RELPERSISTENCE_TEMP ?
-									PGAIO_HCB_LOCAL_BUFFER_READV :
-									PGAIO_HCB_SHARED_BUFFER_READV,
-									flags);
+	pgaio_io_set_flag(ioh, ioh_flags);
 
-		pgaio_io_set_flag(ioh, ioh_flags);
+	/* ---
+	 * Even though we're trying to issue IO asynchronously, track the time
+	 * in smgrstartreadv():
+	 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+	 *   immediately
+	 * - the io method might not support the IO (e.g. worker IO for a temp
+	 *   table)
+	 * ---
+	 */
+	io_start = pgstat_prepare_io_time(track_io_timing);
+	smgrstartreadv(ioh, operation->smgr, forknum,
+				   blocknum + nblocks_done,
+				   io_pages, io_buffers_len);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+							io_start, 1, io_buffers_len * BLCKSZ);
 
-		/* ---
-		 * Even though we're trying to issue IO asynchronously, track the time
-		 * in smgrstartreadv():
-		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
-		 *   immediately
-		 * - the io method might not support the IO (e.g. worker IO for a temp
-		 *   table)
-		 * ---
-		 */
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum + nblocks_done,
-					   io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
-								io_start, 1, io_buffers_len * BLCKSZ);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
-		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += io_buffers_len;
+	else
+		pgBufferUsage.shared_blks_read += io_buffers_len;
 
-		/*
-		 * Track vacuum cost when issuing IO, not after waiting for it.
-		 * Otherwise we could end up issuing a lot of IO in a short timespan,
-		 * despite a low cost limit.
-		 */
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	/*
+	 * Track vacuum cost when issuing IO, not after waiting for it. Otherwise
+	 * we could end up issuing a lot of IO in a short timespan, despite a low
+	 * cost limit.
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
-		*nblocks_progress = io_buffers_len;
-		did_start_io = true;
-	}
+	*nblocks_progress = io_buffers_len;
 
-	return did_start_io;
+	return true;
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4017896f951..f85a9acc6ac 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,8 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	/* true if waiting on another backend's IO */
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9036fef129b..92230994633 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2363,6 +2363,7 @@ PredicateLockData
 PredicateLockTargetType
 PrefetchBufferResult
 PrepParallelRestorePtrType
+PrepareReadBuffer_Status
 PrepareStmt
 PreparedStatement
 PresortedKeyData
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-17 17:26             ` Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-17 17:26 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

> Attached v5 adds some comments to the tests, fixes a few nits in the
> actual code, and adds a commit to fix what I think is an existing
> off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.


> Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 00bc609529a..0723d4f3dd8 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
>  		 * must have started out as a miss in PinBufferForBlock(). The other
>  		 * backend will track this as a 'read'.
>  		 */
> -		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> +		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
>  										  operation->smgr->smgr_rlocator.locator.spcOid,
>  										  operation->smgr->smgr_rlocator.locator.dbOid,
>  										  operation->smgr->smgr_rlocator.locator.relNumber,
> --

Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...



> Subject: [PATCH v5 4/5] Make buffer hit helper
>
> Already two places count buffer hits, requiring quite a few lines of
> code since we do accounting in so many places. Future commits will add
> more locations, so refactor into a helper.
>
> Reviewed-by: Nazir Bilal Yavuz <[email protected]>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
> ---
>  src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
>  1 file changed, 56 insertions(+), 55 deletions(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 0723d4f3dd8..399004c2e44 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
>  									  bool *foundPtr, IOContext io_context);
>  static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
>  static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
> +
> +static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
> +													  Relation rel, char persistence, SMgrRelation smgr,
> +													  ForkNumber forknum, BlockNumber blocknum);
>  static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
>  static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
>  								IOObject io_object, IOContext io_context);
> @@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
>  				  bool *foundPtr)
>  {
>  	BufferDesc *bufHdr;
> -	IOContext	io_context;
> -	IOObject	io_object;
>
>  	Assert(blockNum != P_NEW);
>
> @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
>  			persistence == RELPERSISTENCE_PERMANENT ||
>  			persistence == RELPERSISTENCE_UNLOGGED));
>
> -	if (persistence == RELPERSISTENCE_TEMP)
> -	{
> -		io_context = IOCONTEXT_NORMAL;
> -		io_object = IOOBJECT_TEMP_RELATION;
> -	}
> -	else
> -	{
> -		io_context = IOContextForStrategy(strategy);
> -		io_object = IOOBJECT_RELATION;
> -	}
> -

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I don't think that necessarily has to conflict with the goal of this patch -
most of the the deduplicated stuff isn't io_context, so the helper will be
beneficial even if have to pull out the io_context/io_object determination to
the callsites.


>  	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
>  									   smgr->smgr_rlocator.locator.spcOid,
>  									   smgr->smgr_rlocator.locator.dbOid,
> @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
>  									   smgr->smgr_rlocator.backend);
>
>  	if (persistence == RELPERSISTENCE_TEMP)
> -	{
>  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
> -		if (*foundPtr)
> -			pgBufferUsage.local_blks_hit++;
> -	}
>  	else
> -	{
>  		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
> -							 strategy, foundPtr, io_context);
> -		if (*foundPtr)
> -			pgBufferUsage.shared_blks_hit++;
> -	}
> +							 strategy, foundPtr, IOContextForStrategy(strategy));
> +
>  	if (rel)
>  	{
>  		/*

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).



> +/*
> + * We track various stats related to buffer hits. Because this is done in a
> + * few separate places, this helper exists for convenience.
> + */
> +static pg_attribute_always_inline void
> +CountBufferHit(BufferAccessStrategy strategy,
> +			   Relation rel, char persistence, SMgrRelation smgr,
> +			   ForkNumber forknum, BlockNumber blocknum)
> +{
> +	IOContext	io_context;
> +	IOObject	io_object;
> +
> +	if (persistence == RELPERSISTENCE_TEMP)
> +	{
> +		io_context = IOCONTEXT_NORMAL;
> +		io_object = IOOBJECT_TEMP_RELATION;
> +	}
> +	else
> +	{
> +		io_context = IOContextForStrategy(strategy);
> +		io_object = IOOBJECT_RELATION;
> +	}
> +
> +	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
> +									  blocknum,
> +									  smgr->smgr_rlocator.locator.spcOid,
> +									  smgr->smgr_rlocator.locator.dbOid,
> +									  smgr->smgr_rlocator.locator.relNumber,
> +									  smgr->smgr_rlocator.backend,
> +									  true);
> +
> +	if (persistence == RELPERSISTENCE_TEMP)
> +		pgBufferUsage.local_blks_hit += 1;
> +	else
> +		pgBufferUsage.shared_blks_hit += 1;
> +
> +	if (rel)
> +		pgstat_count_buffer_hit(rel);
> +
> +	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
> +
> +	if (VacuumCostActive)
> +		VacuumCostBalance += VacuumCostPageHit;
> +}

I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.



> From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Fri, 23 Jan 2026 14:00:31 -0500
> Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When a backend attempts to start a read on a buffer and finds that I/O
> is already in progress, it previously waited for that I/O to complete
> before initiating reads for any other buffers. Although the backend must
> still wait for the I/O to finish when later acquiring the buffer, it
> should not need to wait at read start time. Other buffers may be
> available for I/O, and in some workloads this waiting significantly
> reduces concurrency.
>
> For example, index scans may repeatedly request the same heap block. If
> the backend waits each time it encounters an in-progress read, the
> access pattern effectively degenerates into synchronous I/O. By
> introducing the concept of foreign I/O operations, a backend can record
> the buffer’s wait reference and defer waiting until WaitReadBuffers()
> when it actually acquires the buffer.
>
> In rare cases, a backend may still need to wait when starting a read if
> it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
> but before the buffer descriptor’s wait reference has been set. Such
> windows should be brief and uncommon.
>
> Author: Melanie Plageman <[email protected]>
> Reviewed-by: Andres Freund <[email protected]>
> Reviewed-by: Nazir Bilal Yavuz <[email protected]>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv

> +/*
> + * In AsyncReadBuffers(), when preparing a buffer for reading and setting
> + * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
> + * already contain the desired block. AsyncReadBuffers() must distinguish
> + * between these cases (and the case where it should initiate I/O) so it can
> + * mark an in-progress buffer as foreign I/O rather than waiting on it.
> + */
> +typedef enum PrepareReadBuffer_Status
> +{
> +	READ_BUFFER_ALREADY_DONE,
> +	READ_BUFFER_IN_PROGRESS,
> +	READ_BUFFER_READY_FOR_IO,
> +} PrepareReadBuffer_Status;

I don't personally like mixing underscore and camel case naming within one
name.

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS


>  /*
>   * We track various stats related to buffer hits. Because this is done in a
>   * few separate places, this helper exists for convenience.
> @@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>  			 * b) reports some time as waiting, even if we never waited
>  			 *
>  			 * we first check if we already know the IO is complete.
> +			 *
> +			 * Note that operation->io_return is uninitialized for foreign IO,
> +			 * so we cannot count that wait time.
>  			 */

I'm confused - your comment says we can't count wait time with a foreign IO,
but then oes on to count foreign IO time?  The lack of io_return just means we
can't do  the cheaper pre-check for PGAIO_RS_UNKNOWN, no?


> -			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
> +			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
>  				!pgaio_wref_check_done(&operation->io_wref))
>  			{
>  				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
> @@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>  				Assert(pgaio_wref_check_done(&operation->io_wref));
>  			}
>
> -			/*
> -			 * We now are sure the IO completed. Check the results. This
> -			 * includes reporting on errors if there were any.
> -			 */
> -			ProcessReadBuffersResult(operation);
> +			if (unlikely(operation->foreign_io))
> +			{
> +				Buffer		buffer = operation->buffers[operation->nblocks_done];
> +				BufferDesc *desc = BufferIsLocal(buffer) ?
> +					GetLocalBufferDescriptor(-buffer - 1) :
> +					GetBufferDescriptor(buffer - 1);
> +				uint32		buf_state = pg_atomic_read_u64(&desc->state);
> +
> +				if (buf_state & BM_VALID)
> +				{
> +					operation->nblocks_done += 1;
> +					Assert(operation->nblocks_done <= operation->nblocks);
> +
> +					CountBufferHit(operation->strategy,
> +								   operation->rel, operation->persistence,
> +								   operation->smgr, operation->forknum,
> +								   operation->blocknum + operation->nblocks_done - 1);

Probably worth including a comment explaining why we count this as a hit. IIRC
earlier versions had such a comment.


> +/*
> + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> + * avoid an external function call.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> +							Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.



> +{
> +	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
> +	uint64		buf_state = pg_atomic_read_u64(&desc->state);
> +
> +	/* Already valid, no work to do */
> +	if (buf_state & BM_VALID)
> +	{
> +		pgaio_wref_clear(&operation->io_wref);
> +		return READ_BUFFER_ALREADY_DONE;
> +	}

Is this reachable for local buffers?


> +	pgaio_submit_staged();
> +
> +	if (pgaio_wref_valid(&desc->io_wref))
> +	{
> +		operation->io_wref = desc->io_wref;
> +		operation->foreign_io = true;
> +		return READ_BUFFER_IN_PROGRESS;
> +	}
> +
> +	/*
> +	 * While it is possible for a buffer to have been prepared for IO but not
> +	 * yet had its wait reference set, there's no way for us to know that for
> +	 * temporary buffers. Thus, we'll prepare for own IO on this buffer.
> +	 */
> +	return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?


> +/*
> + * Try to start IO on the first buffer in a new run of blocks. If AIO is in
> + * progress, be it in this backend or another backend, we just associate the
> + * wait reference with the operation and wait in WaitReadBuffers(). This turns
> + * out to be important for performance in two workloads:
> + *
> + * 1) A read stream that has to read the same block multiple times within the
> + *    readahead distance. This can happen e.g. for the table accesses of an
> + *    index scan.
> + *
> + * 2) Concurrent scans by multiple backends on the same relation.
> + *
> + * If we were to synchronously wait for the in-progress IO, we'd not be able
> + * to keep enough I/O in flight.
> + *
> + * If we do find there is ongoing I/O for the buffer, we set up a 1-block
> + * ReadBuffersOperation that WaitReadBuffers then can wait on.
> + *
> + * It's possible that another backend has started IO on the buffer but not yet
> + * set its wait reference. In this case, we have no choice but to wait for
> + * either the wait reference to be valid or the IO to be done.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> +					   Buffer buffer)
> +{

I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"?  Or ...


> +	uint64		buf_state;
> +	BufferDesc *desc;
> +
> +	if (BufferIsLocal(buffer))
> +		return PrepareNewLocalReadBufferIO(operation, buffer);
> +
> +	ResourceOwnerEnlarge(CurrentResourceOwner);
> +	desc = GetBufferDescriptor(buffer - 1);
> +
> +	for (;;)
> +	{
> +		buf_state = LockBufHdr(desc);

Perhaps worth adding an
   Assert(buf_state & BM_TAG_VALID)?


> +		/* Already valid, no work to do */
> +		if (buf_state & BM_VALID)
> +		{
> +			UnlockBufHdr(desc);
> +			pgaio_wref_clear(&operation->io_wref);

Perhaps we should clear &operation->io_wref once at the start? Because right
now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS &&
&desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a
wref set.  It seems either we should just touch &operation->io_wref if
  BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref)
or we should reliably do it.



> +			return READ_BUFFER_ALREADY_DONE;
> +		}
> +
> +		if (buf_state & BM_IO_IN_PROGRESS)
> +		{
> +			/* Join existing read */
> +			if (pgaio_wref_valid(&desc->io_wref))
> +			{
> +				operation->io_wref = desc->io_wref;
> +				operation->foreign_io = true;
> +				UnlockBufHdr(desc);
> +				return READ_BUFFER_IN_PROGRESS;
> +			}
> +
> +			/*
> +			 * If the wait ref is not valid but the IO is in progress, someone
> +			 * else started IO but hasn't set the wait ref yet. We have no
> +			 * choice but to wait until the IO completes.
> +			 */
> +			UnlockBufHdr(desc);
> +			pgaio_submit_staged();
> +			WaitIO(desc);
> +			continue;

Before this commit there was an explanation for this submit:

-    /*
-     * If this backend currently has staged IO, we need to submit the pending
-     * IO before waiting for the right to issue IO, to avoid the potential for
-     * deadlocks (and, more commonly, unnecessary delays for other backends).
-     */

Seems that vanished?



> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only be
> + * used for additional adjacent buffers following the head buffer in a new IO.
> + *
> + * Returns true if the buffer was successfully prepared for IO and false if it
> + * is rejected and the read IO should not include this buffer.
> + */
> +static bool
> +PrepareAdditionalReadBuffer(Buffer buffer)

I think it'd be good to mention that this may never wait for IO or such to
avoid deadlocks.



> +	/* Check if we can start IO on the first to-be-read buffer */
> +	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
> +		READ_BUFFER_READY_FOR_IO)
> +	{

I don't love this < bit. For one there's no mention in
PrepareReadBuffer_Status mentioning that the numerical order is important. Any
reason to not just test != READ_BUFFER_READY_FOR_IO?

The assignment inside the if also looks somewhat awkward. For while() loops
there's often not really a better way to write it, but here you could just as
well do the status assignment in a line before.


> +		pgaio_io_release(ioh);
> +		*nblocks_progress = 1;
> +		if (status == READ_BUFFER_ALREADY_DONE)
> +		{
> +			/*
> +			 * Someone else has already completed this block, we're done.
> +			 *
> +			 * When IO is necessary, ->nblocks_done is updated in
> +			 * ProcessReadBuffersResult(), but that is not called if no IO is
> +			 * necessary. Thus update here.
> +			 */
> +			operation->nblocks_done += 1;
> +			Assert(operation->nblocks_done <= operation->nblocks);
> +
> +			/*
> +			 * Report and track this as a 'hit' for this backend, even though
> +			 * it must have started out as a miss in PinBufferForBlock(). The
> +			 * other backend will track this as a 'read'.
> +			 */
> +			CountBufferHit(operation->strategy,
> +						   operation->rel, operation->persistence,
> +						   operation->smgr, operation->forknum,
> +						   operation->blocknum + operation->nblocks_done - 1);
> +			return false;
> +		}
> +
> +		/* The IO is already in-progress */
> +		Assert(status == READ_BUFFER_IN_PROGRESS);
> +		CheckReadBuffersOperation(operation, false);

I was about to suggest that there should be a CheckReadBuffersOperation() for
both returns here, but there already are CheckReadBuffersOperation after calls
to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be
removed.



>  	/*
> -	 * Check if we can start IO on the first to-be-read buffer.
> -	 *
> -	 * If an I/O is already in progress in another backend, we want to wait
> -	 * for the outcome: either done, or something went wrong and we will
> -	 * retry.
> +	 * How many neighboring-on-disk blocks can we scatter-read into other
> +	 * buffers at the same time?  In this case we don't wait if we see an I/O
> +	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
> +	 * block, so we should get on with that I/O as soon as possible.
>  	 */
> -	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
> +	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
>  	{
> -		/*
> -		 * Someone else has already completed this block, we're done.
> -		 *
> -		 * When IO is necessary, ->nblocks_done is updated in
> -		 * ProcessReadBuffersResult(), but that is not called if no IO is
> -		 * necessary. Thus update here.
> -		 */
> -		operation->nblocks_done += 1;
> -		*nblocks_progress = 1;
> -
> -		pgaio_io_release(ioh);
> -		pgaio_wref_clear(&operation->io_wref);
> -		did_start_io = false;
> +		if (!PrepareAdditionalReadBuffer(buffers[i]))
> +			break;
> +		/* Must be consecutive block numbers. */
> +		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
> +			   BufferGetBlockNumber(buffers[i]) - 1);

Seems this assert could stand to be before the PrepareAdditionalReadBuffer(),
as it better hold true before we try to BM_IO_IN_PROGRESS?

I realize this is old, but since you're whacking this around...



> diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
> index 4017896f951..f85a9acc6ac 100644
> --- a/src/include/storage/bufmgr.h
> +++ b/src/include/storage/bufmgr.h
> @@ -147,6 +147,8 @@ struct ReadBuffersOperation
>  	int			flags;
>  	int16		nblocks;
>  	int16		nblocks_done;
> +	/* true if waiting on another backend's IO */
> +	bool		foreign_io;
>  	PgAioWaitRef io_wref;
>  	PgAioReturn io_return;
>  };

This adds an alignment-padding hole between nblocks_done and io_wref.  Read
stream can allocate quite a few of these, so it's probably worth avoiding?

There's a padding hole between persistence and forknum, but that seems a bit
ugly to utilize. A bit less ugly if we swapped forknum and persistence.

Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and
there are just four flag bits right now).

But perhaps it's also not worth worrying about right now.


[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-18 16:59               ` Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-18 16:59 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Thanks for the review!

Everything you suggested that I don't elaborate on below, I've just
gone ahead and done in attached v6.

On Tue, Mar 17, 2026 at 1:26 PM Andres Freund <[email protected]> wrote:
>
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
> >                * must have started out as a miss in PinBufferForBlock(). The other
> >                * backend will track this as a 'read'.
> >                */
> > -             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> > +             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
> >                                                                                 operation->smgr->smgr_rlocator.locator.spcOid,
> >                                                                                 operation->smgr->smgr_rlocator.locator.dbOid,
> >                                                                                 operation->smgr->smgr_rlocator.locator.relNumber,
> > --
>
> Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
> be easier to understand if we stashed blocknum + nblocks_done into a local
> var, and use it in in both branches of if (!ReadBuffersCanStartIO())?
>
> This probably needs to be backpatched...

0003 in v6 does as you suggest. I'll backport it after a quick +1 here.

> > Subject: [PATCH v5 4/5] Make buffer hit helper
>
> > @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
> >                       persistence == RELPERSISTENCE_PERMANENT ||
> >                       persistence == RELPERSISTENCE_UNLOGGED));
> >
> > -     if (persistence == RELPERSISTENCE_TEMP)
> > -     {
> > -             io_context = IOCONTEXT_NORMAL;
> > -             io_object = IOOBJECT_TEMP_RELATION;
> > -     }
> > -     else
> > -     {
> > -             io_context = IOContextForStrategy(strategy);
> > -             io_object = IOOBJECT_RELATION;
> > -     }
> > -
>
> I'm mildly worried that this will lead to a bit worse code generation, the
> compiler might have a harder time figuring out that io_context/io_object
> doesn't change across multiple PinBufferForBlock calls. Although it already
> might be unable to do so (we don't mark IOContextForStrategy as
> pure [1]).
>
> I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
> direction, and explicitly look up IOContextForStrategy(strategy) *before* the
> actual_nblocks loop to make sure the compiler doesn't inject external function
> calls (which will in all likelihood require register spilling etc).

I added a separate patch to refactor the code to do this first (0004).

> > @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
> >                                                                          smgr->smgr_rlocator.backend);
> >
> >       if (persistence == RELPERSISTENCE_TEMP)
>
> And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
> branch in CountBufferHit(), I suspect the compiler may not be able to optimize
> it away.

And you think it is optimizing it away in PinBufferForBlock()?

> At the very least I'd invert the call to CountBufferHit() and the
> pgstat_count_buffer_read(), as the latter will probably prevent most
> optimizations (due to the compiler not being able to prove that
> (rel)->pgstat_info->counts.blocks_fetched is a different memory location as
> *foundPtr).

I did this. I did not check the compiled code before or after though.

> > +CountBufferHit(BufferAccessStrategy strategy,
> > +                        Relation rel, char persistence, SMgrRelation smgr,
> > +                        ForkNumber forknum, BlockNumber blocknum)
>
> I don't think "Count*" is a great name for something that does tracepoints and
> vacuum cost balance accounting, the latter actually changes behavior of the
> program due to the sleeps it injects.
>
> The first alternative I have is AccountForBufferHit(), not great, but still
> seems a bit better.

At some point, I had ProcessBufferHit(), but Bilal felt it implied the
function did more than counting. I've changed it now to
TrackBufferHit().

> > From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <[email protected]>
> > Date: Fri, 23 Jan 2026 14:00:31 -0500
> > Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
>
> I wonder if might be worth splitting this up in a refactoring and a
> "behavioural change" commit. Might be too complicated.
>
> Candidates for a split seem to be:
> - Moving pgaio_io_acquire_nb() to earlier
> - Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
> for READ_BUFFER_IN_PROGRESS
> - introduce READ_BUFFER_IN_PROGRESS

I've done something like this in v6.

> > + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> > + * avoid an external function call.
> > + */
> > +static PrepareReadBuffer_Status
> > +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> > +                                                     Buffer buffer)
>
> Hm, seems the test in 0002 should be extended to cover the the temp table case.

I did this. However, I was a bit lazy in how many cases I added
because I used invalidate_rel_block(), which is pretty verbose (since
evict_rel() doesn't work yet for local buffers).

I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
(though perhaps we aren't testing it for shared buffers either?).

> > +{
> > +     BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
> > +     uint64          buf_state = pg_atomic_read_u64(&desc->state);
> > +
> > +     /* Already valid, no work to do */
> > +     if (buf_state & BM_VALID)
> > +     {
> > +             pgaio_wref_clear(&operation->io_wref);
> > +             return READ_BUFFER_ALREADY_DONE;
> > +     }
>
> Is this reachable for local buffers?

Yes, I think this is reachable by local buffers that started the IO
already and then completed it when acquiring a new IO handle at the
top of AsyncReadBuffers().

> > +     if (pgaio_wref_valid(&desc->io_wref))
> > +     {
> > +             operation->io_wref = desc->io_wref;
> > +             operation->foreign_io = true;
> > +             return READ_BUFFER_IN_PROGRESS;
> > +     }
> > +
> > +     /*
> > +      * While it is possible for a buffer to have been prepared for IO but not
> > +      * yet had its wait reference set, there's no way for us to know that for
> > +      * temporary buffers. Thus, we'll prepare for own IO on this buffer.
> > +      */
> > +     return READ_BUFFER_READY_FOR_IO;
>
> Is that actually possible? And would it be ok to just do start IO in that
> case?

You're right, that's not possible for local buffers. For local
buffers, we "prepare for IO" by calling PrepareNewLocalReadBufferIO()
and then set the wait ref in a codepath initiated by calling
smgrstartreadv() as part of "staging" the IO. No one can observe that
buffer in between the call to PrepareNewLocalReadBufferIO() and
setting the wait reference. So, I've deleted the comment.

> > +static PrepareReadBuffer_Status
> > +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> > +                                        Buffer buffer)
> > +{
>
> I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
> "Continue"? Or "First" & "Additional"?  Or ...

I changed the names to PrepareHeadBufferReadIO() and
PrepareAdditionalBufferReadIO(). "Head" instead of "First" because
First felt like it implied the first buffer ever and head seems to
make it clear it is the first buffer of this new IO.

- Melanie


Attachments:

  [text/x-patch] v6-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch (10.8K, 2-v6-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch)
  download | inline diff:
From 53f010e7072fb5bda9a342c32fb6035da41c9c5c Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 9 Sep 2025 10:14:34 -0400
Subject: [PATCH v6 1/8] aio: Refactor tests in preparation for more tests

In a future commit more AIO related tests are due to be introduced. However
001_aio.pl already is fairly large.

This commit introduces a new TestAio package with helpers for writing AIO
related tests. Then it uses the new helpers to simplify the existing
001_aio.pl by iterating over all supported io_methods. This will be
particularly helpful because additional methods already have been submitted.

Additionally this commit splits out testing of initdb using a non-default
method into its own test. While that test is somewhat important, it's fairly
slow and doesn't break that often. For development velocity it's helpful for
001_aio.pl to be faster.

While particularly the latter could benefit from being its own commit, it
seems to introduce more back-and-forth than it's worth.

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/
---
 src/test/modules/test_aio/meson.build     |   1 +
 src/test/modules/test_aio/t/001_aio.pl    | 141 +++++++---------------
 src/test/modules/test_aio/t/003_initdb.pl |  71 +++++++++++
 src/test/modules/test_aio/t/TestAio.pm    |  90 ++++++++++++++
 4 files changed, 204 insertions(+), 99 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/003_initdb.pl
 create mode 100644 src/test/modules/test_aio/t/TestAio.pm

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index fefa25bc5ab..18a797f3a3b 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -32,6 +32,7 @@ tests += {
     'tests': [
       't/001_aio.pl',
       't/002_io_workers.pl',
+      't/003_initdb.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 5c634ec3ca9..e18b2a2b8ae 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -7,126 +7,56 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use FindBin;
+use lib $FindBin::RealBin;
 
-###
-# Test io_method=worker
-###
-my $node_worker = create_node('worker');
-$node_worker->start();
-
-test_generic('worker', $node_worker);
-SKIP:
-{
-	skip 'Injection points not supported by this build', 1
-	  unless $ENV{enable_injection_points} eq 'yes';
-	test_inject_worker('worker', $node_worker);
-}
+use TestAio;
 
-$node_worker->stop();
+my @methods = TestAio::supported_io_methods();
+my %nodes;
 
 
 ###
-# Test io_method=io_uring
+# Create and configure one instance for each io_method
 ###
 
-if (have_io_uring())
+foreach my $method (@methods)
 {
-	my $node_uring = create_node('io_uring');
-	$node_uring->start();
-	test_generic('io_uring', $node_uring);
-	$node_uring->stop();
-}
-
-
-###
-# Test io_method=sync
-###
-
-my $node_sync = create_node('sync');
+	my $node = PostgreSQL::Test::Cluster->new($method);
 
-# just to have one test not use the default auto-tuning
+	$nodes{$method} = $node;
+	$node->init();
+	$node->append_conf('postgresql.conf', "io_method=$method");
+	TestAio::configure($node);
+}
 
-$node_sync->append_conf(
+# Just to have one test not use the default auto-tuning
+$nodes{'sync'}->append_conf(
 	'postgresql.conf', qq(
-io_max_concurrency=4
+ io_max_concurrency=4
 ));
 
-$node_sync->start();
-test_generic('sync', $node_sync);
-$node_sync->stop();
-
-done_testing();
-
 
 ###
-# Test Helpers
+# Execute the tests for each io_method
 ###
 
-sub create_node
+foreach my $method (@methods)
 {
-	local $Test::Builder::Level = $Test::Builder::Level + 1;
-
-	my $io_method = shift;
+	my $node = $nodes{$method};
 
-	my $node = PostgreSQL::Test::Cluster->new($io_method);
-
-	# Want to test initdb for each IO method, otherwise we could just reuse
-	# the cluster.
-	#
-	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
-	# options specified by ->extra, if somebody puts -c io_method=xyz in
-	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
-	# detect it.
-	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
-	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
-		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
-	{
-		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
-	}
-
-	$node->init(extra => [ '-c', "io_method=$io_method" ]);
-
-	$node->append_conf(
-		'postgresql.conf', qq(
-shared_preload_libraries=test_aio
-log_min_messages = 'DEBUG3'
-log_statement=all
-log_error_verbosity=default
-restart_after_crash=false
-temp_buffers=100
-));
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
 
-	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
-	# io_method, it'd override the setting persisted at initdb time. While
-	# using (and later verifying) the setting from initdb provides some
-	# verification of having used the io_method during initdb, it's probably
-	# not worth the complication of only appending if the variable is set in
-	# in TEMP_CONFIG.
-	$node->append_conf(
-		'postgresql.conf', qq(
-io_method=$io_method
-));
+done_testing();
 
-	ok(1, "$io_method: initdb");
 
-	return $node;
-}
+###
+# Test Helpers
+###
 
-sub have_io_uring
-{
-	# To detect if io_uring is supported, we look at the error message for
-	# assigning an invalid value to an enum GUC, which lists all the valid
-	# options. We need to use -C to deal with running as administrator on
-	# windows, the superuser check is omitted if -C is used.
-	my ($stdout, $stderr) =
-	  run_command [qw(postgres -C invalid -c io_method=invalid)];
-	die "can't determine supported io_method values"
-	  unless $stderr =~ m/Available values: ([^\.]+)\./;
-	my $methods = $1;
-	note "supported io_method values are: $methods";
-
-	return ($methods =~ m/io_uring/) ? 1 : 0;
-}
 
 sub psql_like
 {
@@ -1490,8 +1420,8 @@ SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
 }
 
 
-# Run all tests that are supported for all io_methods
-sub test_generic
+# Run all tests that for the specified node / io_method
+sub test_io_method
 {
 	my $io_method = shift;
 	my $node = shift;
@@ -1526,10 +1456,23 @@ CHECKPOINT;
 	test_ignore_checksum($io_method, $node);
 	test_checksum_createdb($io_method, $node);
 
+	# generic injection tests
   SKIP:
 	{
 		skip 'Injection points not supported by this build', 1
 		  unless $ENV{enable_injection_points} eq 'yes';
 		test_inject($io_method, $node);
 	}
+
+	# worker specific injection tests
+	if ($io_method eq 'worker')
+	{
+	  SKIP:
+		{
+			skip 'Injection points not supported by this build', 1
+			  unless $ENV{enable_injection_points} eq 'yes';
+
+			test_inject_worker($io_method, $node);
+		}
+	}
 }
diff --git a/src/test/modules/test_aio/t/003_initdb.pl b/src/test/modules/test_aio/t/003_initdb.pl
new file mode 100644
index 00000000000..c03ae58d00a
--- /dev/null
+++ b/src/test/modules/test_aio/t/003_initdb.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test initdb for each IO method. This is done separately from 001_aio.pl, as
+# it isn't fast. This way the more commonly failing / hacked-on 001_aio.pl can
+# be iterated on more quickly.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	test_create_node($method);
+}
+
+done_testing();
+
+
+sub test_create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	TestAio::configure($node);
+
+	# Even though we used -c io_method=... above, if TEMP_CONFIG sets
+	# io_method, it'd override the setting persisted at initdb time. While
+	# using (and later verifying) the setting from initdb provides some
+	# verification of having used the io_method during initdb, it's probably
+	# not worth the complication of only appending if the variable is set in
+	# in TEMP_CONFIG.
+	$node->append_conf(
+		'postgresql.conf', qq(
+io_method=$io_method
+));
+
+	ok(1, "$io_method: initdb");
+
+	$node->start();
+	$node->stop();
+	ok(1, "$io_method: start & stop");
+
+	return $node;
+}
diff --git a/src/test/modules/test_aio/t/TestAio.pm b/src/test/modules/test_aio/t/TestAio.pm
new file mode 100644
index 00000000000..5bc80a9b130
--- /dev/null
+++ b/src/test/modules/test_aio/t/TestAio.pm
@@ -0,0 +1,90 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+TestAio - helpers for writing AIO related tests
+
+=cut
+
+package TestAio;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item TestAio::supported_io_methods()
+
+Return an array of all the supported values for the io_method GUC
+
+=cut
+
+sub supported_io_methods()
+{
+	my @io_methods = ('worker');
+
+	push(@io_methods, "io_uring") if have_io_uring();
+
+	# Return sync last, as it will least commonly fail
+	push(@io_methods, "sync");
+
+	return @io_methods;
+}
+
+
+=item TestAio::configure()
+
+Prepare a cluster for AIO test
+
+=cut
+
+sub configure
+{
+	my $node = shift;
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+}
+
+
+=pod
+
+=item TestAio::have_io_uring()
+
+Return if io_uring is supported
+
+=cut
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+1;
-- 
2.43.0



  [text/x-patch] v6-0002-test_aio-Add-read_stream-test-infrastructure-test.patch (24.1K, 3-v6-0002-test_aio-Add-read_stream-test-infrastructure-test.patch)
  download | inline diff:
From d510b46530bd219c9767e42e02f093d3460babef Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 10 Sep 2025 14:00:02 -0400
Subject: [PATCH v6 2/8] test_aio: Add read_stream test infrastructure & tests

Author: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/test/modules/test_aio/meson.build         |   1 +
 .../modules/test_aio/t/004_read_stream.pl     | 286 +++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql   |  24 +-
 src/test/modules/test_aio/test_aio.c          | 346 +++++++++++++++---
 src/tools/pgindent/typedefs.list              |   1 +
 5 files changed, 607 insertions(+), 51 deletions(-)
 create mode 100644 src/test/modules/test_aio/t/004_read_stream.pl

diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index 18a797f3a3b..909f81d96c1 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -33,6 +33,7 @@ tests += {
       't/001_aio.pl',
       't/002_io_workers.pl',
       't/003_initdb.pl',
+      't/004_read_stream.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_aio/t/004_read_stream.pl b/src/test/modules/test_aio/t/004_read_stream.pl
new file mode 100644
index 00000000000..17a68e35c1d
--- /dev/null
+++ b/src/test/modules/test_aio/t/004_read_stream.pl
@@ -0,0 +1,286 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use TestAio;
+
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init();
+
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+max_connections=8
+io_method=worker
+));
+
+$node->start();
+test_setup($node);
+$node->stop();
+
+
+foreach my $method (TestAio::supported_io_methods())
+{
+	$node->adjust_conf('postgresql.conf', 'io_method', $method);
+	$node->start();
+	test_io_method($method, $node);
+	$node->stop();
+}
+
+done_testing();
+
+
+sub test_setup
+{
+	my $node = shift;
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+
+CREATE TABLE largeish(k int not null) WITH (FILLFACTOR=10);
+INSERT INTO largeish(k) SELECT generate_series(1, 10000);
+));
+	ok(1, "setup");
+}
+
+
+sub test_repeated_blocks
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Preventing larger reads makes testing easier
+	$psql->query_safe(
+		qq/ SET io_combine_limit = 1; /);
+
+	# test miss of the same block twice in a row
+	$psql->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	# block 0 grows the distance enough that the stream will look ahead and try
+	# to start a pending read for block 2 (and later block 4) twice before
+	# returning any buffers.
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 2, 2, 4, 4]); /);
+
+	ok(1, "$io_method: stream missing the same block repeatedly");
+
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 2, 2, 4, 4]); /);
+	ok(1, "$io_method: stream hitting the same block repeatedly");
+
+	# test hit of the same block twice in a row
+	$psql->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish',
+			ARRAY[0, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1, 0]); /);
+	ok(1, "$io_method: stream accessing same block");
+
+	# Test repeated blocks with a temp table, using invalidate_rel_block()
+	# to evict individual local buffers.
+	$psql->query_safe(
+		qq/ CREATE TEMP TABLE largeish_temp(k int not null) WITH (FILLFACTOR=10);
+			INSERT INTO largeish_temp(k) SELECT generate_series(1, 200); /);
+
+	# Evict the specific blocks we'll request to force misses
+	$psql->query_safe(
+		qq/ SELECT invalidate_rel_block('largeish_temp', 0); /);
+	$psql->query_safe(
+		qq/ SELECT invalidate_rel_block('largeish_temp', 2); /);
+	$psql->query_safe(
+		qq/ SELECT invalidate_rel_block('largeish_temp', 4); /);
+
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish_temp',
+			ARRAY[0, 2, 2, 4, 4]); /);
+	ok(1, "$io_method: temp stream missing the same block repeatedly");
+
+	# Now the blocks are cached, so repeated access should be hits
+	$psql->query_safe(
+		qq/ SELECT * FROM read_stream_for_blocks('largeish_temp',
+			ARRAY[0, 2, 2, 4, 4]); /);
+	ok(1, "$io_method: temp stream hitting the same block repeatedly");
+
+	$psql->quit();
+}
+
+
+sub test_inject_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	my $pid_a = $psql_a->query_safe(qq/SELECT pg_backend_pid();/);
+
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads succeeding.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>5, nblocks=>1);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until(
+		'postgres', qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	# Block 5 is undergoing IO in session b, so session a will move on to start
+	# a new IO for block 7.
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	ok(1, qq/$io_method: read stream encounters succeeding IO by another backend/);
+
+	###
+	# Test read stream encountering buffers undergoing IO in another backend,
+	# with the other backend's reads failing.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_short_read_attach(-errno_from_string('EIO'),
+			pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>5, nblocks=>1);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 5, 7]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres', qq/SELECT inj_io_completion_continue()/);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,5,7\}/);
+
+	pump_until(
+		$psql_b->{run}, $psql_b->{timeout},
+		\$psql_b->{stderr}, qr/ERROR.*could not read blocks 5\.\.5/);
+	ok(1, "$io_method: injected error occurred");
+	$psql_b->{stderr} = '';
+	$psql_b->query_safe(qq/SELECT inj_io_short_read_detach();/);
+
+	ok(1,
+		qq/$io_method: read stream encounters failing IO by another backend/);
+
+
+	###
+	# Test read stream encountering two buffers that are undergoing the same
+	# IO, started by another backend.
+	###
+	$psql_a->query_safe(
+		qq/ SELECT evict_rel('largeish'); /);
+
+	$psql_b->query_safe(
+		qq/ SELECT inj_io_completion_wait(pid=>pg_backend_pid(),
+			relfilenode=>pg_relation_filenode('largeish')); /);
+
+	$psql_b->{stdin} .= qq/ SELECT read_rel_block_ll('largeish',
+		blockno=>2, nblocks=>3);\n/;
+	$psql_b->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq/ SELECT wait_event FROM pg_stat_activity
+			WHERE wait_event = 'completion_wait'; /,
+		'completion_wait');
+
+	# Blocks 2 and 4 are undergoing IO initiated by session a
+	$psql_a->{stdin} .= qq/ SELECT array_agg(blocknum) FROM
+		read_stream_for_blocks('largeish', ARRAY[0, 2, 4]);\n/;
+	$psql_a->{run}->pump_nb();
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid_a),
+		'AioIoCompletion');
+
+	$node->safe_psql('postgres',
+		qq/ SELECT inj_io_completion_continue() /);
+
+	pump_until(
+		$psql_a->{run}, $psql_a->{timeout},
+		\$psql_a->{stdout}, qr/\{0,2,4\}/);
+
+	ok(1, qq/$io_method: read stream encounters two buffer read in one IO/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+
+sub test_io_method
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	test_repeated_blocks($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject_foreign($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..1cc4734a746 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -33,6 +33,10 @@ CREATE FUNCTION read_rel_block_ll(
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
+CREATE FUNCTION evict_rel(rel regclass)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
 CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
@@ -50,6 +54,14 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+/*
+ * Read stream related functions
+ */
+CREATE FUNCTION read_stream_for_blocks(rel regclass, blocks int4[], OUT blockoff int4, OUT blocknum int4, OUT buf int4)
+RETURNS SETOF record STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Handle related functions
@@ -91,8 +103,16 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 /*
  * Injection point related functions
  */
-CREATE FUNCTION inj_io_short_read_attach(result int)
-RETURNS pg_catalog.void STRICT
+CREATE FUNCTION inj_io_completion_wait(pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_completion_continue()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_attach(result int, pid int DEFAULT NULL, relfilenode oid DEFAULT 0)
+RETURNS pg_catalog.void
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION inj_io_short_read_detach()
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index b1aa8af9ec0..061f0c9f92a 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -19,17 +19,26 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "catalog/pg_type.h"
 #include "fmgr.h"
+#include "funcapi.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "storage/read_stream.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/rel.h"
+#include "utils/tuplestore.h"
+#include "utils/wait_event.h"
 
 
 PG_MODULE_MAGIC;
@@ -37,13 +46,30 @@ PG_MODULE_MAGIC;
 
 typedef struct InjIoErrorState
 {
+	ConditionVariable cv;
+
 	bool		enabled_short_read;
 	bool		enabled_reopen;
 
+	bool		enabled_completion_wait;
+	Oid			completion_wait_relfilenode;
+	pid_t		completion_wait_pid;
+	uint32		completion_wait_event;
+
 	bool		short_read_result_set;
+	Oid			short_read_relfilenode;
+	pid_t		short_read_pid;
 	int			short_read_result;
 } InjIoErrorState;
 
+typedef struct BlocksReadStreamData
+{
+	int			nblocks;
+	int			curblock;
+	uint32	   *blocks;
+} BlocksReadStreamData;
+
+
 static InjIoErrorState *inj_io_error_state;
 
 /* Shared memory init callbacks */
@@ -85,10 +111,13 @@ test_aio_shmem_startup(void)
 		inj_io_error_state->enabled_short_read = false;
 		inj_io_error_state->enabled_reopen = false;
 
+		ConditionVariableInit(&inj_io_error_state->cv);
+		inj_io_error_state->completion_wait_event = WaitEventInjectionPointNew("completion_wait");
+
 #ifdef USE_INJECTION_POINTS
 		InjectionPointAttach("aio-process-completion-before-shared",
 							 "test_aio",
-							 "inj_io_short_read",
+							 "inj_io_completion_hook",
 							 NULL,
 							 0);
 		InjectionPointLoad("aio-process-completion-before-shared");
@@ -384,7 +413,7 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
 		elog(ERROR, "nblocks is out of range");
 
-	rel = relation_open(relid, AccessExclusiveLock);
+	rel = relation_open(relid, AccessShareLock);
 
 	for (int i = 0; i < nblocks; i++)
 	{
@@ -458,6 +487,27 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(evict_rel);
+Datum
+evict_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	int32		buffers_evicted,
+				buffers_flushed,
+				buffers_skipped;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	EvictRelUnpinnedBuffers(rel, &buffers_evicted, &buffers_flushed,
+							&buffers_skipped);
+
+	relation_close(rel, AccessExclusiveLock);
+
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(invalidate_rel_block);
 Datum
 invalidate_rel_block(PG_FUNCTION_ARGS)
@@ -610,6 +660,86 @@ buffer_call_terminate_io(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+
+static BlockNumber
+read_stream_for_blocks_cb(ReadStream *stream,
+						  void *callback_private_data,
+						  void *per_buffer_data)
+{
+	BlocksReadStreamData *stream_data = callback_private_data;
+
+	if (stream_data->curblock >= stream_data->nblocks)
+		return InvalidBlockNumber;
+	return stream_data->blocks[stream_data->curblock++];
+}
+
+PG_FUNCTION_INFO_V1(read_stream_for_blocks);
+Datum
+read_stream_for_blocks(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *blocksarray = PG_GETARG_ARRAYTYPE_P(1);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Relation	rel;
+	BlocksReadStreamData stream_data;
+	ReadStream *stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	/*
+	 * We expect the input to be an N-element int4 array; verify that. We
+	 * don't need to use deconstruct_array() since the array data is just
+	 * going to look like a C array of N int4 values.
+	 */
+	if (ARR_NDIM(blocksarray) != 1 ||
+		ARR_HASNULL(blocksarray) ||
+		ARR_ELEMTYPE(blocksarray) != INT4OID)
+		elog(ERROR, "expected 1 dimensional int4 array");
+
+	stream_data.curblock = 0;
+	stream_data.nblocks = ARR_DIMS(blocksarray)[0];
+	stream_data.blocks = (uint32 *) ARR_DATA_PTR(blocksarray);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										read_stream_for_blocks_cb,
+										&stream_data,
+										0);
+
+	for (int i = 0; i < stream_data.nblocks; i++)
+	{
+		Buffer		buf = read_stream_next_buffer(stream, NULL);
+		Datum		values[3] = {0};
+		bool		nulls[3] = {0};
+
+		if (!BufferIsValid(buf))
+			elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly invalid", i);
+
+		values[0] = Int32GetDatum(i);
+		values[1] = UInt32GetDatum(stream_data.blocks[i]);
+		values[2] = UInt32GetDatum(buf);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+
+		ReleaseBuffer(buf);
+	}
+
+	if (read_stream_next_buffer(stream, NULL) != InvalidBuffer)
+		elog(ERROR, "read_stream_next_buffer() call %d is unexpectedly valid",
+			 stream_data.nblocks + 1);
+
+	read_stream_end(stream);
+
+	relation_close(rel, NoLock);
+
+	return (Datum) 0;
+}
+
+
 PG_FUNCTION_INFO_V1(handle_get);
 Datum
 handle_get(PG_FUNCTION_ARGS)
@@ -680,15 +810,98 @@ batch_end(PG_FUNCTION_ARGS)
 }
 
 #ifdef USE_INJECTION_POINTS
-extern PGDLLEXPORT void inj_io_short_read(const char *name,
-										  const void *private_data,
-										  void *arg);
+extern PGDLLEXPORT void inj_io_completion_hook(const char *name,
+											   const void *private_data,
+											   void *arg);
 extern PGDLLEXPORT void inj_io_reopen(const char *name,
 									  const void *private_data,
 									  void *arg);
 
-void
-inj_io_short_read(const char *name, const void *private_data, void *arg)
+static bool
+inj_io_short_read_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_short_read)
+		return false;
+
+	if (!inj_io_error_state->short_read_result_set)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->short_read_pid != 0 &&
+		inj_io_error_state->short_read_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->short_read_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->short_read_relfilenode)
+		return false;
+
+	/*
+	 * Only shorten reads that are actually longer than the target size,
+	 * otherwise we can trigger over-reads.
+	 */
+	if (inj_io_error_state->short_read_result >= ioh->result)
+		return false;
+
+	return true;
+}
+
+static bool
+inj_io_completion_wait_matches(PgAioHandle *ioh)
+{
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioTargetData *td;
+
+	if (!inj_io_error_state->enabled_completion_wait)
+		return false;
+
+	owner_proc = GetPGProcByNumber(pgaio_io_get_owner(ioh));
+	owner_pid = owner_proc->pid;
+
+	if (inj_io_error_state->completion_wait_pid != owner_pid)
+		return false;
+
+	td = pgaio_io_get_target_data(ioh);
+
+	if (inj_io_error_state->completion_wait_relfilenode != InvalidOid &&
+		td->smgr.rlocator.relNumber != inj_io_error_state->completion_wait_relfilenode)
+		return false;
+
+	return true;
+}
+
+static void
+inj_io_completion_wait_hook(const char *name, const void *private_data, void *arg)
+{
+	PgAioHandle *ioh = (PgAioHandle *) arg;
+
+	if (!inj_io_completion_wait_matches(ioh))
+		return;
+
+	ConditionVariablePrepareToSleep(&inj_io_error_state->cv);
+
+	while (true)
+	{
+		if (!inj_io_completion_wait_matches(ioh))
+			break;
+
+		ConditionVariableSleep(&inj_io_error_state->cv,
+							   inj_io_error_state->completion_wait_event);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+static void
+inj_io_short_read_hook(const char *name, const void *private_data, void *arg)
 {
 	PgAioHandle *ioh = (PgAioHandle *) arg;
 
@@ -697,58 +910,56 @@ inj_io_short_read(const char *name, const void *private_data, void *arg)
 				   inj_io_error_state->enabled_reopen),
 			errhidestmt(true), errhidecontext(true));
 
-	if (inj_io_error_state->enabled_short_read)
+	if (inj_io_short_read_matches(ioh))
 	{
+		struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+		int32		old_result = ioh->result;
+		int32		new_result = inj_io_error_state->short_read_result;
+		int32		processed = 0;
+
+		ereport(LOG,
+				errmsg("short read inject point, changing result from %d to %d",
+					   old_result, new_result),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
-		 * Only shorten reads that are actually longer than the target size,
-		 * otherwise we can trigger over-reads.
+		 * The underlying IO actually completed OK, and thus the "invalid"
+		 * portion of the IOV actually contains valid data. That can hide a
+		 * lot of problems, e.g. if we were to wrongly mark a buffer, that
+		 * wasn't read according to the shortened-read, IO as valid, the
+		 * contents would look valid and we might miss a bug.
+		 *
+		 * To avoid that, iterate through the IOV and zero out the "failed"
+		 * portion of the IO.
 		 */
-		if (inj_io_error_state->short_read_result_set
-			&& ioh->op == PGAIO_OP_READV
-			&& inj_io_error_state->short_read_result <= ioh->result)
+		for (int i = 0; i < ioh->op_data.read.iov_length; i++)
 		{
-			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
-			int32		old_result = ioh->result;
-			int32		new_result = inj_io_error_state->short_read_result;
-			int32		processed = 0;
-
-			ereport(LOG,
-					errmsg("short read inject point, changing result from %d to %d",
-						   old_result, new_result),
-					errhidestmt(true), errhidecontext(true));
-
-			/*
-			 * The underlying IO actually completed OK, and thus the "invalid"
-			 * portion of the IOV actually contains valid data. That can hide
-			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
-			 * that wasn't read according to the shortened-read, IO as valid,
-			 * the contents would look valid and we might miss a bug.
-			 *
-			 * To avoid that, iterate through the IOV and zero out the
-			 * "failed" portion of the IO.
-			 */
-			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			if (processed + iov[i].iov_len <= new_result)
+				processed += iov[i].iov_len;
+			else if (processed <= new_result)
 			{
-				if (processed + iov[i].iov_len <= new_result)
-					processed += iov[i].iov_len;
-				else if (processed <= new_result)
-				{
-					uint32		ok_part = new_result - processed;
-
-					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
-					processed += iov[i].iov_len;
-				}
-				else
-				{
-					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
-				}
-			}
+				uint32		ok_part = new_result - processed;
 
-			ioh->result = new_result;
+				memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+				processed += iov[i].iov_len;
+			}
+			else
+			{
+				memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+			}
 		}
+
+		ioh->result = new_result;
 	}
 }
 
+void
+inj_io_completion_hook(const char *name, const void *private_data, void *arg)
+{
+	inj_io_completion_wait_hook(name, private_data, arg);
+	inj_io_short_read_hook(name, private_data, arg);
+}
+
 void
 inj_io_reopen(const char *name, const void *private_data, void *arg)
 {
@@ -762,6 +973,39 @@ inj_io_reopen(const char *name, const void *private_data, void *arg)
 }
 #endif
 
+PG_FUNCTION_INFO_V1(inj_io_completion_wait);
+Datum
+inj_io_completion_wait(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = true;
+	inj_io_error_state->completion_wait_pid =
+		PG_ARGISNULL(0) ? 0 : PG_GETARG_INT32(0);
+	inj_io_error_state->completion_wait_relfilenode =
+		PG_ARGISNULL(1) ? InvalidOid : PG_GETARG_OID(1);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_completion_continue);
+Datum
+inj_io_completion_continue(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_completion_wait = false;
+	inj_io_error_state->completion_wait_pid = 0;
+	inj_io_error_state->completion_wait_relfilenode = InvalidOid;
+	ConditionVariableBroadcast(&inj_io_error_state->cv);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
 Datum
 inj_io_short_read_attach(PG_FUNCTION_ARGS)
@@ -771,6 +1015,10 @@ inj_io_short_read_attach(PG_FUNCTION_ARGS)
 	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
 	if (inj_io_error_state->short_read_result_set)
 		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+	inj_io_error_state->short_read_pid =
+		PG_ARGISNULL(1) ? 0 : PG_GETARG_INT32(1);
+	inj_io_error_state->short_read_relfilenode =
+		PG_ARGISNULL(2) ? 0 : PG_GETARG_OID(2);
 #else
 	elog(ERROR, "injection points not supported");
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 174e2798443..340662cf72c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -305,6 +305,7 @@ BlockSampler
 BlockSamplerData
 BlockedProcData
 BlockedProcsData
+BlocksReadStreamData
 BlocktableEntry
 BloomBuildState
 BloomFilter
-- 
2.43.0



  [text/x-patch] v6-0003-Fix-off-by-one-error-in-read-IO-tracing.patch (2.4K, 4-v6-0003-Fix-off-by-one-error-in-read-IO-tracing.patch)
  download | inline diff:
From 0606856c97cec2da29a70fa5fedfb0ec4bbed842 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 16 Mar 2026 16:50:56 -0400
Subject: [PATCH v6 3/8] Fix off-by-one error in read IO tracing

AsyncReadBuffer()'s no-IO needed path passed
TRACE_POSTGRESQL_BUFFER_READ_DONE the wrong block number because it had
already incremented operation->nblocks_done. Fix by folding the
nblocks_done offset into the blocknum local variable at initialization.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2%40hph6fvdiiria
---
 src/backend/storage/buffer/bufmgr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..10afae1990b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1875,10 +1875,10 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 {
 	Buffer	   *buffers = &operation->buffers[0];
 	int			flags = operation->flags;
-	BlockNumber blocknum = operation->blocknum;
 	ForkNumber	forknum = operation->forknum;
 	char		persistence = operation->persistence;
 	int16		nblocks_done = operation->nblocks_done;
+	BlockNumber blocknum = operation->blocknum + nblocks_done;
 	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
 	int			io_buffers_len = 0;
 	PgAioHandle *ioh;
@@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum,
 										  operation->smgr->smgr_rlocator.locator.spcOid,
 										  operation->smgr->smgr_rlocator.locator.dbOid,
 										  operation->smgr->smgr_rlocator.locator.relNumber,
@@ -2062,7 +2062,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
 		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum + nblocks_done,
+					   blocknum,
 					   io_pages, io_buffers_len);
 		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
 								io_start, 1, io_buffers_len * BLCKSZ);
-- 
2.43.0



  [text/x-patch] v6-0004-Pass-io_object-and-io_context-through-to-PinBuffe.patch (3.4K, 5-v6-0004-Pass-io_object-and-io_context-through-to-PinBuffe.patch)
  download | inline diff:
From fbfd1d9df11d81903e5810c5165ca9e234a6aa26 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 17 Mar 2026 15:49:52 -0400
Subject: [PATCH v6 4/8] Pass io_object and io_context through to
 PinBufferForBlock()

PinBufferForBlock() is always_inline and called in a loop in
StartReadBuffersImpl(). Previously it computed io_context and io_object
internally, which required calling IOContextForStrategy() -- a non-inline
function the compiler cannot prove is side-effect-free. This could
potential cause unneeded redundant function calls.

Compute io_context and io_object in the callers instead, allowing
StartReadBuffersImpl() to do so once before entering the loop.

Suggested-by: Andres Freund <[email protected]>
---
 src/backend/storage/buffer/bufmgr.c | 45 ++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 10afae1990b..ab9c2a4b904 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1223,11 +1223,11 @@ PinBufferForBlock(Relation rel,
 				  ForkNumber forkNum,
 				  BlockNumber blockNum,
 				  BufferAccessStrategy strategy,
+				  IOObject io_object,
+				  IOContext io_context,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
 
@@ -1236,17 +1236,6 @@ PinBufferForBlock(Relation rel,
 			persistence == RELPERSISTENCE_PERMANENT ||
 			persistence == RELPERSISTENCE_UNLOGGED));
 
-	if (persistence == RELPERSISTENCE_TEMP)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -1339,9 +1328,23 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 				 mode == RBM_ZERO_AND_LOCK))
 	{
 		bool		found;
+		IOContext	io_context;
+		IOObject	io_object;
+
+		if (persistence == RELPERSISTENCE_TEMP)
+		{
+			io_context = IOCONTEXT_NORMAL;
+			io_object = IOOBJECT_TEMP_RELATION;
+		}
+		else
+		{
+			io_context = IOContextForStrategy(strategy);
+			io_object = IOOBJECT_RELATION;
+		}
 
 		buffer = PinBufferForBlock(rel, smgr, persistence,
-								   forkNum, blockNum, strategy, &found);
+								   forkNum, blockNum, strategy,
+								   io_object, io_context, &found);
 		ZeroAndLockBuffer(buffer, mode, found);
 		return buffer;
 	}
@@ -1379,11 +1382,24 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
 	bool		did_start_io;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	Assert(*nblocks == 1 || allow_forwarding);
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
 		bool		found;
@@ -1432,6 +1448,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 										   operation->forknum,
 										   blockNum + i,
 										   operation->strategy,
+										   io_object, io_context,
 										   &found);
 		}
 
-- 
2.43.0



  [text/x-patch] v6-0005-Make-buffer-hit-helper.patch (5.0K, 6-v6-0005-Make-buffer-hit-helper.patch)
  download | inline diff:
From 9e57a5deff40b0e3272809a048fc6950646f8146 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:54:02 -0500
Subject: [PATCH v6 5/8] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/backend/storage/buffer/bufmgr.c | 84 ++++++++++++++---------------
 1 file changed, 42 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ab9c2a4b904..fa85570a791 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -648,6 +648,11 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+
+static pg_attribute_always_inline void TrackBufferHit(IOObject io_object,
+													  IOContext io_context,
+													  Relation rel, char persistence, SMgrRelation smgr,
+													  ForkNumber forknum, BlockNumber blocknum);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
@@ -1243,18 +1248,14 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.backend);
 
 	if (persistence == RELPERSISTENCE_TEMP)
-	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
 	else
-	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
 							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+
+	if (*foundPtr)
+		TrackBufferHit(io_object, io_context, rel, persistence, smgr, forkNum, blockNum);
+
 	if (rel)
 	{
 		/*
@@ -1263,21 +1264,6 @@ PinBufferForBlock(Relation rel,
 		 * zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
-		if (*foundPtr)
-			pgstat_count_buffer_hit(rel);
-	}
-	if (*foundPtr)
-	{
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
-
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  true);
 	}
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1712,6 +1698,37 @@ ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 	return ReadBuffersCanStartIOOnce(buffer, nowait);
 }
 
+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static pg_attribute_always_inline void
+TrackBufferHit(IOObject io_object, IOContext io_context,
+			   Relation rel, char persistence, SMgrRelation smgr,
+			   ForkNumber forknum, BlockNumber blocknum)
+{
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+}
+
 /*
  * Helper for WaitReadBuffers() that processes the results of a readv
  * operation, raising an error if necessary.
@@ -2007,25 +2024,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum,
-										  operation->smgr->smgr_rlocator.locator.spcOid,
-										  operation->smgr->smgr_rlocator.locator.dbOid,
-										  operation->smgr->smgr_rlocator.locator.relNumber,
-										  operation->smgr->smgr_rlocator.backend,
-										  true);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
-		else
-			pgBufferUsage.shared_blks_hit += 1;
-
-		if (operation->rel)
-			pgstat_count_buffer_hit(operation->rel);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		TrackBufferHit(io_object, io_context, operation->rel, persistence,
+					   operation->smgr, forknum, blocknum);
 	}
 	else
 	{
-- 
2.43.0



  [text/x-patch] v6-0006-Restructure-AsyncReadBuffers.patch (9.9K, 7-v6-0006-Restructure-AsyncReadBuffers.patch)
  download | inline diff:
From b73b896febc35253ca2607cb0fe143355b91256f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 18 Mar 2026 11:09:58 -0400
Subject: [PATCH v6 6/8] Restructure AsyncReadBuffers()

Restructure AsyncReadBuffers() to use early return when the head buffer
is already valid, instead of using a did_start_io flag and if/else
branches. Also move around a bit of the code to be located closer to
where it is used. This is a refactor only.
---
 src/backend/storage/buffer/bufmgr.c | 208 ++++++++++++++--------------
 1 file changed, 103 insertions(+), 105 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa85570a791..a9995b75917 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1920,21 +1920,12 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		did_start_io;
-
-	/*
-	 * When this IO is executed synchronously, either because the caller will
-	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
-	 * the AIO subsystem needs to know.
-	 */
-	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
-		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+	instr_time	io_start;
 
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1942,35 +1933,6 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	/*
-	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
-	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
-	 * set globally, but on a per-session basis. The completion callback,
-	 * which may be run in other processes, e.g. in IO workers, may have a
-	 * different value of the zero_damaged_pages GUC.
-	 *
-	 * XXX: We probably should eventually use a different flag for
-	 * zero_damaged_pages, so we can report different log levels / error codes
-	 * for zero_damaged_pages and ZERO_ON_ERROR.
-	 */
-	if (zero_damaged_pages)
-		flags |= READ_BUFFERS_ZERO_ON_ERROR;
-
-	/*
-	 * For the same reason as with zero_damaged_pages we need to use this
-	 * backend's ignore_checksum_failure value.
-	 */
-	if (ignore_checksum_failure)
-		flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;
-
-
-	/*
-	 * To be allowed to report stats in the local completion callback we need
-	 * to prepare to report stats now. This ensures we can safely report the
-	 * checksum failure even in a critical section.
-	 */
-	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
-
 	/*
 	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
 	 * might block, which we don't want after setting IO_IN_PROGRESS.
@@ -1992,7 +1954,6 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	if (unlikely(!ioh))
 	{
 		pgaio_submit_staged();
-
 		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
 	}
 
@@ -2017,91 +1978,128 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 
 		pgaio_io_release(ioh);
 		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
 
 		/*
 		 * Report and track this as a 'hit' for this backend, even though it
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TrackBufferHit(io_object, io_context, operation->rel, persistence,
-					   operation->smgr, forknum, blocknum);
+		TrackBufferHit(io_object, io_context,
+					   operation->rel, operation->persistence,
+					   operation->smgr, operation->forknum,
+					   blocknum);
+		return false;
 	}
-	else
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
+
+	/*
+	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
+	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
+	 * set globally, but on a per-session basis. The completion callback,
+	 * which may be run in other processes, e.g. in IO workers, may have a
+	 * different value of the zero_damaged_pages GUC.
+	 *
+	 * XXX: We probably should eventually use a different flag for
+	 * zero_damaged_pages, so we can report different log levels / error codes
+	 * for zero_damaged_pages and ZERO_ON_ERROR.
+	 */
+	if (zero_damaged_pages)
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
+
+	/*
+	 * For the same reason as with zero_damaged_pages we need to use this
+	 * backend's ignore_checksum_failure value.
+	 */
+	if (ignore_checksum_failure)
+		flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;
+
+	/*
+	 * To be allowed to report stats in the local completion callback we need
+	 * to prepare to report stats now. This ensures we can safely report the
+	 * checksum failure even in a critical section.
+	 */
+	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
+
+	Assert(io_buffers[0] == buffers[nblocks_done]);
+	io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
+	io_buffers_len = 1;
+
+	/*
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
+	 */
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 	{
-		instr_time	io_start;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);
 
-		/* We found a buffer that we need to read in. */
-		Assert(io_buffers[0] == buffers[nblocks_done]);
-		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
-		io_buffers_len = 1;
+		if (!ReadBuffersCanStartIO(buffers[i], true))
+			break;
 
-		/*
-		 * How many neighboring-on-disk blocks can we scatter-read into other
-		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
-		 * head block, so we should get on with that I/O as soon as possible.
-		 */
-		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
-		{
-			if (!ReadBuffersCanStartIO(buffers[i], true))
-				break;
-			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
-				   BufferGetBlockNumber(buffers[i]) - 1);
-			Assert(io_buffers[io_buffers_len] == buffers[i]);
+		Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
-		}
+		io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+	}
 
-		/* get a reference to wait for in WaitReadBuffers() */
-		pgaio_io_get_wref(ioh, &operation->io_wref);
+	/* get a reference to wait for in WaitReadBuffers() */
+	pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* provide the list of buffers to the completion callbacks */
-		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+	/* provide the list of buffers to the completion callbacks */
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-		pgaio_io_register_callbacks(ioh,
-									persistence == RELPERSISTENCE_TEMP ?
-									PGAIO_HCB_LOCAL_BUFFER_READV :
-									PGAIO_HCB_SHARED_BUFFER_READV,
-									flags);
+	pgaio_io_register_callbacks(ioh,
+								persistence == RELPERSISTENCE_TEMP ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								flags);
 
-		pgaio_io_set_flag(ioh, ioh_flags);
+	pgaio_io_set_flag(ioh, ioh_flags);
 
-		/* ---
-		 * Even though we're trying to issue IO asynchronously, track the time
-		 * in smgrstartreadv():
-		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
-		 *   immediately
-		 * - the io method might not support the IO (e.g. worker IO for a temp
-		 *   table)
-		 * ---
-		 */
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum,
-					   io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
-								io_start, 1, io_buffers_len * BLCKSZ);
+	/* ---
+	 * Even though we're trying to issue IO asynchronously, track the time
+	 * in smgrstartreadv():
+	 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+	 *   immediately
+	 * - the io method might not support the IO (e.g. worker IO for a temp
+	 *   table)
+	 * ---
+	 */
+	io_start = pgstat_prepare_io_time(track_io_timing);
+	smgrstartreadv(ioh, operation->smgr, forknum,
+				   blocknum,
+				   io_pages, io_buffers_len);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+							io_start, 1, io_buffers_len * BLCKSZ);
 
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
-		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += io_buffers_len;
+	else
+		pgBufferUsage.shared_blks_read += io_buffers_len;
 
-		/*
-		 * Track vacuum cost when issuing IO, not after waiting for it.
-		 * Otherwise we could end up issuing a lot of IO in a short timespan,
-		 * despite a low cost limit.
-		 */
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	/*
+	 * Track vacuum cost when issuing IO, not after waiting for it. Otherwise
+	 * we could end up issuing a lot of IO in a short timespan, despite a low
+	 * cost limit.
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
-		*nblocks_progress = io_buffers_len;
-		did_start_io = true;
-	}
+	*nblocks_progress = io_buffers_len;
 
-	return did_start_io;
+	return true;
 }
 
 /*
-- 
2.43.0



  [text/x-patch] v6-0007-Introduce-PrepareHeadBufferReadIO-and-PrepareAddi.patch (8.1K, 8-v6-0007-Introduce-PrepareHeadBufferReadIO-and-PrepareAddi.patch)
  download | inline diff:
From 200af0d589054f8d015a1ed4ae347c684149bde8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 18 Mar 2026 11:13:25 -0400
Subject: [PATCH v6 7/8] Introduce PrepareHeadBufferReadIO() and
 PrepareAdditionalBufferReadIO()

Replace ReadBuffersCanStartIO() and ReadBuffersCanStartIOOnce() with
new explicit helper functions that inline the logic from
StartBufferIO() and StartLocalBufferIO().

Besides the inlined logic being easier to reason, StartBufferIO()
doesn't distinguish between 'already valid' and 'IO in progress' (and
explicitly states it does not want to), which is required to defer
waiting for in-progress IO. A future commit will implement deferred
waiting for in-progress IO.
---
 src/backend/storage/buffer/bufmgr.c | 171 +++++++++++++++++++++++-----
 1 file changed, 141 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a9995b75917..2179ade07cc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1659,43 +1659,150 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
 #endif
 }
 
-/* helper for ReadBuffersCanStartIO(), to avoid repetition */
-static inline bool
-ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
+/*
+ * Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static bool
+PrepareHeadLocalBufferReadIO(Buffer buffer)
 {
-	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
-								  true, nowait);
-	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/*
+	 * The buffer could already be valid if a prior IO by this backend was
+	 * completed and reclaimed incidentally (e.g. while acquiring a new AIO
+	 * handle). Only the owning backend can set BM_VALID on a local buffer.
+	 */
+	if (buf_state & BM_VALID)
+		return false;
+
+	/*
+	 * Submit any staged IO before checking for in-progress IO. Without this,
+	 * the wref check below could find IO that this backend staged but hasn't
+	 * submitted yet. Waiting on that would PANIC because the owner can't wait
+	 * on its own staged IO.
+	 */
+	pgaio_submit_staged();
+
+	/* Wait for in-progress IO */
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		PgAioWaitRef iow = desc->io_wref;
+
+		pgaio_wref_wait(&iow);
+
+		buf_state = pg_atomic_read_u64(&desc->state);
+	}
+
+	/*
+	 * If BM_VALID is set, we waited on IO and it completed successfully.
+	 * Otherwise, we'll initiate IO on the buffer.
+	 */
+	return !(buf_state & BM_VALID);
 }
 
 /*
- * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we wait for it to
+ * finish and then check the result.
+ *
+ * Returns true if the buffer is ready for IO, false if the buffer is already
+ * valid.
  */
-static inline bool
-ReadBuffersCanStartIO(Buffer buffer, bool nowait)
+static bool
+PrepareHeadBufferReadIO(Buffer buffer)
 {
-	/*
-	 * If this backend currently has staged IO, we need to submit the pending
-	 * IO before waiting for the right to issue IO, to avoid the potential for
-	 * deadlocks (and, more commonly, unnecessary delays for other backends).
-	 */
-	if (!nowait && pgaio_have_staged())
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareHeadLocalBufferReadIO(buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
 	{
-		if (ReadBuffersCanStartIOOnce(buffer, true))
-			return true;
+		buf_state = LockBufHdr(desc);
+
+		Assert(buf_state & BM_TAG_VALID);
+
+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			UnlockBufHdr(desc);
+
+			/*
+			 * If this backend currently has staged IO, submit it before
+			 * waiting for in-progress IO, to avoid potential deadlocks and
+			 * unnecessary delays.
+			 */
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;
+		}
 
 		/*
-		 * Unfortunately StartBufferIO() returning false doesn't allow to
-		 * distinguish between the buffer already being valid and IO already
-		 * being in progress. Since IO already being in progress is quite
-		 * rare, this approach seems fine.
+		 * No IO in progress and not already valid; We will start IO. It's
+		 * possible that the IO was in progress and never became valid because
+		 * the IO errored out. We'll do the IO ourselves.
 		 */
-		pgaio_submit_staged();
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+									  BufferDescriptorGetBuffer(desc));
+
+		return true;
+	}
+}
+
+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * This function must never wait for IO to avoid deadlocks. The head buffer
+ * already has BM_IO_IN_PROGRESS set, so we'll just issue that IO and come
+ * back in lieu of waiting here.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+ */
+static bool
+PrepareAdditionalBufferReadIO(Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		/* Local buffers don't use BM_IO_IN_PROGRESS */
+		if (buf_state & BM_VALID || pgaio_wref_valid(&desc->io_wref))
+			return false;
+	}
+	else
+	{
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+		if (buf_state & (BM_VALID | BM_IO_IN_PROGRESS))
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner, buffer);
 	}
 
-	return ReadBuffersCanStartIOOnce(buffer, nowait);
+	return true;
 }
 
 /*
@@ -1934,8 +2041,10 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	}
 
 	/*
-	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
-	 * might block, which we don't want after setting IO_IN_PROGRESS.
+	 * We must get an IO handle before PrepareHeadBufferReadIO(), as
+	 * pgaio_io_acquire() might block, which we don't want after setting
+	 * IO_IN_PROGRESS. If we don't need to do the IO, we'll release the
+	 * handle.
 	 *
 	 * If we need to wait for IO before we can get a handle, submit
 	 * already-staged IO first, so that other backends don't need to wait.
@@ -1957,6 +2066,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
 	}
 
+	pgaio_wref_clear(&operation->io_wref);
+
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
@@ -1964,10 +2075,10 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	 * for the outcome: either done, or something went wrong and we will
 	 * retry.
 	 */
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	if (!PrepareHeadBufferReadIO(buffers[nblocks_done]))
 	{
 		/*
-		 * Someone else has already completed this block, we're done.
+		 * Someone has already completed this block, we're done.
 		 *
 		 * When IO is necessary, ->nblocks_done is updated in
 		 * ProcessReadBuffersResult(), but that is not called if no IO is
@@ -2046,7 +2157,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
 			   BufferGetBlockNumber(buffers[i]) - 1);
 
-		if (!ReadBuffersCanStartIO(buffers[i], true))
+		if (!PrepareAdditionalBufferReadIO(buffers[i]))
 			break;
 
 		Assert(io_buffers[io_buffers_len] == buffers[i]);
-- 
2.43.0



  [text/x-patch] v6-0008-AIO-Don-t-wait-for-already-in-progress-IO.patch (13.2K, 9-v6-0008-AIO-Don-t-wait-for-already-in-progress-IO.patch)
  download | inline diff:
From 63cb731176a62320d296f968b12a5d4d36e703d0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 18 Mar 2026 11:17:57 -0400
Subject: [PATCH v6 8/8] AIO: Don't wait for already in-progress IO

When a backend attempts to start a read IO and finds the first buffer
already has I/O in progress, previously it waited for that I/O to
complete before initiating reads for any of the subsequent buffers.

Although the backend must wait for the I/O to finish when acquiring the
buffer, there's no reason for it to wait when setting up the read
operation. Waiting at this point prevents the backend from starting I/O
on subsequent buffers and can significantly reduce concurrency.

This matters in two workloads: when multiple backends scan the same
relation concurrently, and when a single backend requests the same block
multiple times within the readahead distance.

If backends wait each time they encounter an in-progress read,
the access pattern effectively degenerates into synchronous I/O.

To fix this, when encountering an already in-progress IO for the head
buffer, a backend now records the buffer's wait reference and defers
waiting until WaitReadBuffers(), when it actually needs to acquire the
buffer.

In rare cases, a backend may still need to wait synchronously at IO
start time: if another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
---
 src/backend/storage/buffer/bufmgr.c | 201 +++++++++++++++++++---------
 src/include/storage/bufmgr.h        |   4 +-
 src/tools/pgindent/typedefs.list    |   1 +
 3 files changed, 145 insertions(+), 61 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2179ade07cc..31d1563a69f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -185,6 +185,20 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBufferStatus
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBufferStatus;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1663,8 +1677,9 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
  * Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to
  * avoid an external function call.
  */
-static bool
-PrepareHeadLocalBufferReadIO(Buffer buffer)
+static PrepareReadBufferStatus
+PrepareHeadLocalBufferReadIO(ReadBuffersOperation *operation,
+							 Buffer buffer)
 {
 	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
 	uint64		buf_state = pg_atomic_read_u64(&desc->state);
@@ -1675,49 +1690,60 @@ PrepareHeadLocalBufferReadIO(Buffer buffer)
 	 * handle). Only the owning backend can set BM_VALID on a local buffer.
 	 */
 	if (buf_state & BM_VALID)
-		return false;
+		return READ_BUFFER_ALREADY_DONE;
 
 	/*
 	 * Submit any staged IO before checking for in-progress IO. Without this,
 	 * the wref check below could find IO that this backend staged but hasn't
-	 * submitted yet. Waiting on that would PANIC because the owner can't wait
-	 * on its own staged IO.
+	 * submitted yet. If we returned READ_BUFFER_IN_PROGRESS and
+	 * WaitReadBuffers() then tried to wait on it, we'd PANIC because the
+	 * owner can't wait on its own staged IO.
 	 */
 	pgaio_submit_staged();
 
-	/* Wait for in-progress IO */
+	/* We've already asynchronously started this IO, so join it */
 	if (pgaio_wref_valid(&desc->io_wref))
 	{
-		PgAioWaitRef iow = desc->io_wref;
-
-		pgaio_wref_wait(&iow);
-
-		buf_state = pg_atomic_read_u64(&desc->state);
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
 	}
 
-	/*
-	 * If BM_VALID is set, we waited on IO and it completed successfully.
-	 * Otherwise, we'll initiate IO on the buffer.
-	 */
-	return !(buf_state & BM_VALID);
+	/* Prepare to start IO on this buffer */
+	return READ_BUFFER_READY_FOR_IO;
 }
 
 /*
  * Try to start IO on the first buffer in a new run of blocks. If AIO is in
- * progress, be it in this backend or another backend, we wait for it to
- * finish and then check the result.
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
  *
- * Returns true if the buffer is ready for IO, false if the buffer is already
- * valid.
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
  */
-static bool
-PrepareHeadBufferReadIO(Buffer buffer)
+static PrepareReadBufferStatus
+PrepareHeadBufferReadIO(ReadBuffersOperation *operation,
+						Buffer buffer)
 {
 	uint64		buf_state;
 	BufferDesc *desc;
 
 	if (BufferIsLocal(buffer))
-		return PrepareHeadLocalBufferReadIO(buffer);
+		return PrepareHeadLocalBufferReadIO(operation, buffer);
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	desc = GetBufferDescriptor(buffer - 1);
@@ -1732,11 +1758,25 @@ PrepareHeadBufferReadIO(Buffer buffer)
 		if (buf_state & BM_VALID)
 		{
 			UnlockBufHdr(desc);
-			return false;
+			return READ_BUFFER_ALREADY_DONE;
 		}
 
 		if (buf_state & BM_IO_IN_PROGRESS)
 		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the IO completes.
+			 */
 			UnlockBufHdr(desc);
 
 			/*
@@ -1758,7 +1798,7 @@ PrepareHeadBufferReadIO(Buffer buffer)
 		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
 									  BufferDescriptorGetBuffer(desc));
 
-		return true;
+		return READ_BUFFER_READY_FOR_IO;
 	}
 }
 
@@ -1939,8 +1979,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 * b) reports some time as waiting, even if we never waited
 			 *
 			 * we first check if we already know the IO is complete.
+			 *
+			 * Note that operation->io_return is uninitialized for foreign IO,
+			 * so we cannot use the cheaper PGAIO_RS_UNKNOWN pre-check.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1959,11 +2002,45 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					BlockNumber blocknum = operation->blocknum + operation->nblocks_done;
+
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Track this as a 'hit' for this backend. The backend
+					 * performing the IO will track it as a 'read'.
+					 */
+					TrackBufferHit(io_object, io_context,
+								   operation->rel, operation->persistence,
+								   operation->smgr, operation->forknum,
+								   blocknum);
+				}
+
+				/*
+				 * If the foreign IO failed and left the buffer invalid,
+				 * nblocks_done is not incremented. The retry loop below will
+				 * call AsyncReadBuffers() which will attempt the IO itself.
+				 */
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -2009,7 +2086,8 @@ WaitReadBuffers(ReadBuffersOperation *operation)
  * affected by the call. If the first buffer is valid, *nblocks_progress is
  * set to 1 and operation->nblocks_done is incremented.
  *
- * Returns true if IO was initiated, false if no IO was necessary.
+ * Returns true if IO was initiated or is already in progress (foreign IO),
+ * false if the buffer was already valid.
  */
 static bool
 AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
@@ -2028,6 +2106,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	IOContext	io_context;
 	IOObject	io_object;
 	instr_time	io_start;
+	PrepareReadBufferStatus status;
 
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
@@ -2066,40 +2145,42 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
 	}
 
+	operation->foreign_io = false;
 	pgaio_wref_clear(&operation->io_wref);
 
-	/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
-	 */
-	if (!PrepareHeadBufferReadIO(buffers[nblocks_done]))
+	/* Check if we can start IO on the first to-be-read buffer */
+	status = PrepareHeadBufferReadIO(operation, buffers[nblocks_done]);
+	if (status != READ_BUFFER_READY_FOR_IO)
 	{
-		/*
-		 * Someone has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
+		pgaio_io_release(ioh);
 		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
 
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			TrackBufferHit(io_object, io_context,
+						   operation->rel, operation->persistence,
+						   operation->smgr, operation->forknum,
+						   blocknum);
+			return false;
+		}
 
-		/*
-		 * Report and track this as a 'hit' for this backend, even though it
-		 * must have started out as a miss in PinBufferForBlock(). The other
-		 * backend will track this as a 'read'.
-		 */
-		TrackBufferHit(io_object, io_context,
-					   operation->rel, operation->persistence,
-					   operation->smgr, operation->forknum,
-					   blocknum);
-		return false;
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		return true;
 	}
 
 	/*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4017896f951..dd41b92f944 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -144,9 +144,11 @@ struct ReadBuffersOperation
 	 */
 	Buffer	   *buffers;
 	BlockNumber blocknum;
-	int			flags;
+	uint16		flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	/* true if waiting on another backend's IO */
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 340662cf72c..ffaea427952 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2365,6 +2365,7 @@ PredicateLockData
 PredicateLockTargetType
 PrefetchBufferResult
 PrepParallelRestorePtrType
+PrepareReadBufferStatus
 PrepareStmt
 PreparedStatement
 PresortedKeyData
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-18 20:16                 ` Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-18 20:16 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-18 12:59:11 -0400, Melanie Plageman wrote:
> On Tue, Mar 17, 2026 at 1:26 PM Andres Freund <[email protected]> wrote:
> >
> > > --- a/src/backend/storage/buffer/bufmgr.c
> > > +++ b/src/backend/storage/buffer/bufmgr.c
> > > @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
> > >                * must have started out as a miss in PinBufferForBlock(). The other
> > >                * backend will track this as a 'read'.
> > >                */
> > > -             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> > > +             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.spcOid,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.dbOid,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.relNumber,
> > > --
> >
> > Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
> > be easier to understand if we stashed blocknum + nblocks_done into a local
> > var, and use it in in both branches of if (!ReadBuffersCanStartIO())?
> >
> > This probably needs to be backpatched...
> 
> 0003 in v6 does as you suggest. I'll backport it after a quick +1 here.

LGTM.



> > > @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
> > >                                                                          smgr->smgr_rlocator.backend);
> > >
> > >       if (persistence == RELPERSISTENCE_TEMP)
> >
> > And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
> > branch in CountBufferHit(), I suspect the compiler may not be able to optimize
> > it away.
> 
> And you think it is optimizing it away in PinBufferForBlock()?

It doesn't really have a choice :) - the
  pgBufferUsage.(local|shared)_blks_hit++
is within the already required
  if (persistence == RELPERSISTENCE_TEMP)

so there's not really a branch to optimize away.

But maybe I misunderstood?


> > At the very least I'd invert the call to CountBufferHit() and the
> > pgstat_count_buffer_read(), as the latter will probably prevent most
> > optimizations (due to the compiler not being able to prove that
> > (rel)->pgstat_info->counts.blocks_fetched is a different memory location as
> > *foundPtr).
> 
> I did this. I did not check the compiled code before or after though.

I checked after and it looks good (well, ok enough, but that's unrelated to
your changes).


Just to verify that any of this actually matters, I did some benchmarking with
the call to TrackBufferHit() removed, and with pg_prewarm() of a scale 100
pgbench_accounts I do see about ~3% of a gain from that.  I did also verify
that with the patch we're ever so slightly, but reproducibly, faster than
master.  There's future optimization potential for sure though.


> > > +CountBufferHit(BufferAccessStrategy strategy,
> > > +                        Relation rel, char persistence, SMgrRelation smgr,
> > > +                        ForkNumber forknum, BlockNumber blocknum)
> >
> > I don't think "Count*" is a great name for something that does tracepoints and
> > vacuum cost balance accounting, the latter actually changes behavior of the
> > program due to the sleeps it injects.
> >
> > The first alternative I have is AccountForBufferHit(), not great, but still
> > seems a bit better.
> 
> At some point, I had ProcessBufferHit(), but Bilal felt it implied the
> function did more than counting. I've changed it now to
> TrackBufferHit().

WFM.


> > > + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> > > + * avoid an external function call.
> > > + */
> > > +static PrepareReadBuffer_Status
> > > +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> > > +                                                     Buffer buffer)
> >
> > Hm, seems the test in 0002 should be extended to cover the the temp table case.
> 
> I did this. However, I was a bit lazy in how many cases I added
> because I used invalidate_rel_block(), which is pretty verbose (since
> evict_rel() doesn't work yet for local buffers).

Ah, yea, that's annoying.  I think some basic coverage is good enough for now.


> I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
> (though perhaps we aren't testing it for shared buffers either?).

We do reach the READ_BUFFER_ALREADY_DONE in PrepareHeadBufferReadIO(), but
only due to io_method=sync peculiarities (as that only actually performs the
IO later when waiting, it's easy to have two IOs for the same block).


It's probably worth adding tests for that, although I suspect it should be in
001_aio.pl - no read stream required to hit it.  I can give it a shot, if you
want?


> > > +static PrepareReadBuffer_Status
> > > +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> > > +                                        Buffer buffer)
> > > +{
> >
> > I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
> > "Continue"? Or "First" & "Additional"?  Or ...
> 
> I changed the names to PrepareHeadBufferReadIO() and
> PrepareAdditionalBufferReadIO(). "Head" instead of "First" because
> First felt like it implied the first buffer ever and head seems to
> make it clear it is the first buffer of this new IO.

Head works!



> Subject: [PATCH v6 4/8] Pass io_object and io_context through to
>  PinBufferForBlock()

The duplication due to handling the RBM_ZERO_AND_CLEANUP_LOCK case is a bit
annoying, but I think it's still an improvement.



> Subject: [PATCH v6 5/8] Make buffer hit helper

LGTM.


> From b73b896febc35253ca2607cb0fe143355b91256f Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 18 Mar 2026 11:09:58 -0400
> Subject: [PATCH v6 6/8] Restructure AsyncReadBuffers()
> 
> Restructure AsyncReadBuffers() to use early return when the head buffer
> is already valid, instead of using a did_start_io flag and if/else
> branches. Also move around a bit of the code to be located closer to
> where it is used. This is a refactor only.

I think there should be as little work as posbile between setting
IO_IN_PROGRESS and starting the IO. Most of the work you deferred is cheap
enough that it shouldn't matter, but pgstat_prepare_report_checksum_failure()
might need to do a bit more (including taking lwlocks and stuff).

I'm also a bit doubtful that deferring the flag determinations is a good idea,
mostly because it adds a bunch of stuff between starting IO on the head and
subsequent buffers. Not that it's expensive, but it seems to make it more
likely that somebody would end up putting other code inbetween the head and
additional buffer IO starts.  And it's cheap enough that it doesn't matter to
waste it if we return early.


> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only be
> + * used for additional adjacent buffers following the head buffer in a new IO.
> + *
> + * This function must never wait for IO to avoid deadlocks. The head buffer
> + * already has BM_IO_IN_PROGRESS set, so we'll just issue that IO and come
> + * back in lieu of waiting here.

"come back" is a bit odd, since you'd not actually come back to
PrepareAdditionalBufferReadIO().


Looks good otherwise.



> From 63cb731176a62320d296f968b12a5d4d36e703d0 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 18 Mar 2026 11:17:57 -0400
> Subject: [PATCH v6 8/8] AIO: Don't wait for already in-progress IO
> 
> When a backend attempts to start a read IO and finds the first buffer
> already has I/O in progress, previously it waited for that I/O to
> complete before initiating reads for any of the subsequent buffers.
> 
> Although the backend must wait for the I/O to finish when acquiring the
> buffer, there's no reason for it to wait when setting up the read
> operation. Waiting at this point prevents the backend from starting I/O
> on subsequent buffers and can significantly reduce concurrency.
> 
> This matters in two workloads: when multiple backends scan the same
> relation concurrently, and when a single backend requests the same block
> multiple times within the readahead distance.
> 
> If backends wait each time they encounter an in-progress read,
> the access pattern effectively degenerates into synchronous I/O.
> 
> To fix this, when encountering an already in-progress IO for the head
> buffer, a backend now records the buffer's wait reference and defers
> waiting until WaitReadBuffers(), when it actually needs to acquire the
> buffer.
> 
> In rare cases, a backend may still need to wait synchronously at IO
> start time: if another backend has set BM_IO_IN_PROGRESS on the buffer
> but has not yet set the wait reference. Such windows should be brief and
> uncommon.



> @@ -1663,8 +1677,9 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
>   * Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to
>   * avoid an external function call.
>   */
> -static bool
> -PrepareHeadLocalBufferReadIO(Buffer buffer)
> +static PrepareReadBufferStatus
> +PrepareHeadLocalBufferReadIO(ReadBuffersOperation *operation,
> +							 Buffer buffer)
>  {
>  	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
>  	uint64		buf_state = pg_atomic_read_u64(&desc->state);
> @@ -1675,49 +1690,60 @@ PrepareHeadLocalBufferReadIO(Buffer buffer)
>  	 * handle). Only the owning backend can set BM_VALID on a local buffer.
>  	 */
>  	if (buf_state & BM_VALID)
> -		return false;
> +		return READ_BUFFER_ALREADY_DONE;
>  
>  	/*
>  	 * Submit any staged IO before checking for in-progress IO. Without this,
>  	 * the wref check below could find IO that this backend staged but hasn't
> -	 * submitted yet. Waiting on that would PANIC because the owner can't wait
> -	 * on its own staged IO.
> +	 * submitted yet. If we returned READ_BUFFER_IN_PROGRESS and
> +	 * WaitReadBuffers() then tried to wait on it, we'd PANIC because the
> +	 * owner can't wait on its own staged IO.
>  	 */
>  	pgaio_submit_staged();
>  
> -	/* Wait for in-progress IO */
> +	/* We've already asynchronously started this IO, so join it */
>  	if (pgaio_wref_valid(&desc->io_wref))
>  	{
> -		PgAioWaitRef iow = desc->io_wref;
> -
> -		pgaio_wref_wait(&iow);
> -
> -		buf_state = pg_atomic_read_u64(&desc->state);
> +		operation->io_wref = desc->io_wref;
> +		operation->foreign_io = true;
> +		return READ_BUFFER_IN_PROGRESS;
>  	}
>  
> -	/*
> -	 * If BM_VALID is set, we waited on IO and it completed successfully.
> -	 * Otherwise, we'll initiate IO on the buffer.
> -	 */
> -	return !(buf_state & BM_VALID);
> +	/* Prepare to start IO on this buffer */
> +	return READ_BUFFER_READY_FOR_IO;
>  }


Hm. Is buf_state & BM_VALID actually not reachable anymore? What if the
pgaio_submit_staged() completed the IO? In that case there won't be a wref and
we'll get here with buf_state & BM_VALID.


> @@ -1732,11 +1758,25 @@ PrepareHeadBufferReadIO(Buffer buffer)
>  		if (buf_state & BM_VALID)
>  		{
>  			UnlockBufHdr(desc);
> -			return false;
> +			return READ_BUFFER_ALREADY_DONE;
>  		}
>  
>  		if (buf_state & BM_IO_IN_PROGRESS)
>  		{
> +			/* Join existing read */
> +			if (pgaio_wref_valid(&desc->io_wref))
> +			{
> +				operation->io_wref = desc->io_wref;
> +				operation->foreign_io = true;
> +				UnlockBufHdr(desc);
> +				return READ_BUFFER_IN_PROGRESS;
> +			}

Out of a strict sense of rule-following, I'd do the operation->foreign_io
after the UnlockBufHdr(), since it doesn't actually need to be in the locked
section.


> @@ -1959,11 +2002,45 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>  				Assert(pgaio_wref_check_done(&operation->io_wref));
>  			}
>  
> -			/*
> -			 * We now are sure the IO completed. Check the results. This
> -			 * includes reporting on errors if there were any.
> -			 */
> -			ProcessReadBuffersResult(operation);
> +			if (unlikely(operation->foreign_io))
> +			{
> +				Buffer		buffer = operation->buffers[operation->nblocks_done];
> +				BufferDesc *desc = BufferIsLocal(buffer) ?
> +					GetLocalBufferDescriptor(-buffer - 1) :
> +					GetBufferDescriptor(buffer - 1);
> +				uint64		buf_state = pg_atomic_read_u64(&desc->state);
> +
> +				if (buf_state & BM_VALID)
> +				{
> +					BlockNumber blocknum = operation->blocknum + operation->nblocks_done;

Maybe we should assert that the buffer's block equals what we expect?



Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-20 19:50                   ` Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-20 19:50 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Thu, Mar 19, 2026 at 6:22 PM Andres Freund <[email protected]> wrote:
>
> Thinking about it more, I also got worried about the duplicating of
> logic. It's perhaps acceptable with the patches as-is, but we'll soon need
> something very similar for AIO writes. Then we'd end up with like 5 variants,
> because we'd still need the existing StartBufferIO() for some cases where we
> do want to wait (e.g. the edge case in ExtendBufferedRelShared()).
>
> In the attached prototype I replaced your patch introducing
> PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
> instead revises StartBufferIO() to have the enum return value you introduced
> and a PgAioWaitRef * argument that callers that would like to asynchronously
> wait for the IO to complete (others pass NULL). There are some other cleanups
> in it too, see the commit message for more details.

I've come around to this. The aspect I like least is that io_wref is
used both as an output parameter _and_ a decision input parameter.

Some callers never want to wait on in-progress IO (and don't have to),
while others must eventually wait but can defer that waiting as long
as they have a wait reference. If they can't get a wait reference,
they have no way to wait later, so they must wait now. The presence of
io_wref indicates this difference.

I think it's important to express that less mechanically than in your
current header comment and comment in the else block of
StartSharedBufferIO() where we do the waiting. Explaining first—before
detailing argument combinations—why a caller would want to pass
io_wref might help.

However, I do think you need to enumerate the different combinations
of wait and io_wref (as you've done) to make it clear what they are.

I, for example, find it very confusing what wait == false and io_wref
not NULL would mean. If IO is in progress on the buffer and the
io_wref is not valid yet, the caller would get the expected
BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
see callers easily misinterpreting the API and passing this
combination when what they want is wait == true and io_wref not NULL
-- because they don't want to synchronously wait.

I don't have any good suggestions despite thinking about it, though.

Two other things about 0007:

    for (int i = nblocks_done + 1; i < operation->nblocks; i++)
    {
        /* Must be consecutive block numbers. */
        Assert(BufferGetBlockNumber(buffers[i - 1]) ==
               BufferGetBlockNumber(buffers[i]) - 1);

        status = StartBufferIO(buffers[nblocks_done], true, false, NULL);

Copy-paste bug above, should be StartBufferIO(buffers[i],...

I would mention that currently BUFFER_IO_IN_PROGRESS is not used in
the first StartBufferIO() case, so that is dead code as of this commit

> I also updated "Restructure AsyncReadBuffers()" to move
> pgstat_prepare_report_checksum_failure() and the computation of flags to
> before the ReadBuffersCanStartIO().  And added a comment explaining why little
> should be added between the ReadBuffersCanStartIO() calls.
>
> Thoughts?

Yea, definitely think the comment is important.

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-25 15:15                     ` Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-25 15:15 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-20 15:50:59 -0400, Melanie Plageman wrote:
> On Thu, Mar 19, 2026 at 6:22 PM Andres Freund <[email protected]> wrote:
> >
> > Thinking about it more, I also got worried about the duplicating of
> > logic. It's perhaps acceptable with the patches as-is, but we'll soon need
> > something very similar for AIO writes. Then we'd end up with like 5 variants,
> > because we'd still need the existing StartBufferIO() for some cases where we
> > do want to wait (e.g. the edge case in ExtendBufferedRelShared()).
> >
> > In the attached prototype I replaced your patch introducing
> > PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
> > instead revises StartBufferIO() to have the enum return value you introduced
> > and a PgAioWaitRef * argument that callers that would like to asynchronously
> > wait for the IO to complete (others pass NULL). There are some other cleanups
> > in it too, see the commit message for more details.
> 
> I've come around to this. The aspect I like least is that io_wref is
> used both as an output parameter _and_ a decision input parameter.

Do you have an alternative suggestion? We could add a dedicated parameter for
that, but then it just opens up different ways of calling the function with
the wrong arguments.


> Some callers never want to wait on in-progress IO (and don't have to),
> while others must eventually wait but can defer that waiting as long
> as they have a wait reference. If they can't get a wait reference,
> they have no way to wait later, so they must wait now. The presence of
> io_wref indicates this difference.
> 
> I think it's important to express that less mechanically than in your
> current header comment and comment in the else block of
> StartSharedBufferIO() where we do the waiting. Explaining first—before
> detailing argument combinations—why a caller would want to pass
> io_wref might help.

I'm not entirely sure what you'd like here.  Would the following comment
do the trick?

 * In several scenarios the buffer may already be undergoing I/O in this or
 * another backend. How to best handle that depends on the caller's
 * situation. It might be appropriate to wait synchronously (e.g., because the
 * buffer is about to be invalidated); wait asynchronously, using the buffer's
 * IO wait reference (e.g., because the caller is doing readahead and doesn't
 * need the buffer to be ready immediately); or to not wait at all (e.g.,
 * because the caller is trying to combine IO for this buffer with another
 * buffer).
 *
 * How and whether to wait is controlled by the wait in io_wref parameters. In
 * detail:
 *
 * <existing comment>



> However, I do think you need to enumerate the different combinations
> of wait and io_wref (as you've done) to make it clear what they are.
> 
> I, for example, find it very confusing what wait == false and io_wref
> not NULL would mean. If IO is in progress on the buffer and the
> io_wref is not valid yet, the caller would get the expected
> BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
> see callers easily misinterpreting the API and passing this
> combination when what they want is wait == true and io_wref not NULL
> -- because they don't want to synchronously wait.

Hm. I started out proposing that we should just add an assert documenting this
is a nonsensical combination. But when writing the comment for that I realized
that it theoretically could make sense to pass in wait == false and io_wref !=
NULL, if you wanted to get a wait reference, but would not want to do a
WaitIO() if there's no wait reference set.

I don't think that's something we need right now, but ...


> I don't have any good suggestions despite thinking about it, though.
> 
> Two other things about 0007:
> 
>     for (int i = nblocks_done + 1; i < operation->nblocks; i++)
>     {
>         /* Must be consecutive block numbers. */
>         Assert(BufferGetBlockNumber(buffers[i - 1]) ==
>                BufferGetBlockNumber(buffers[i]) - 1);
> 
>         status = StartBufferIO(buffers[nblocks_done], true, false, NULL);
> 
> Copy-paste bug above, should be StartBufferIO(buffers[i],...
> 
> I would mention that currently BUFFER_IO_IN_PROGRESS is not used in
> the first StartBufferIO() case, so that is dead code as of this commit

Whaaat.  Why did this even pass tests???  I guess it just always failed to
start IO because there already was IO on the buffer and that was good enough
to get through.

Clearly a testing gap.

Not entirely trivial to test though :(.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-25 15:33                       ` Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-25 15:33 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Wed, Mar 25, 2026 at 11:15 AM Andres Freund <[email protected]> wrote:
>
> > > In the attached prototype I replaced your patch introducing
> > > PrepareHeadBufferReadIO()/ PrepareAdditionalBufferReadIO() with one that
> > > instead revises StartBufferIO() to have the enum return value you introduced
> > > and a PgAioWaitRef * argument that callers that would like to asynchronously
> > > wait for the IO to complete (others pass NULL). There are some other cleanups
> > > in it too, see the commit message for more details.
> >
> > I've come around to this. The aspect I like least is that io_wref is
> > used both as an output parameter _and_ a decision input parameter.
>
> Do you have an alternative suggestion? We could add a dedicated parameter for
> that, but then it just opens up different ways of calling the function with
> the wrong arguments.

Yea, I don't know. The cleanest way to handle callers with different
intentions and tolerances is to have different functions that allow
different behavior. But that's the opposite of what we're currently
trying to do :) I tried to come up with some intention-related enum
input argument, but that seems like a bit of over-engineering.

> > Some callers never want to wait on in-progress IO (and don't have to),
> > while others must eventually wait but can defer that waiting as long
> > as they have a wait reference. If they can't get a wait reference,
> > they have no way to wait later, so they must wait now. The presence of
> > io_wref indicates this difference.
> >
> > I think it's important to express that less mechanically than in your
> > current header comment and comment in the else block of
> > StartSharedBufferIO() where we do the waiting. Explaining first—before
> > detailing argument combinations—why a caller would want to pass
> > io_wref might help.
>
> I'm not entirely sure what you'd like here.  Would the following comment
> do the trick?
>
>  * In several scenarios the buffer may already be undergoing I/O in this or
>  * another backend. How to best handle that depends on the caller's
>  * situation. It might be appropriate to wait synchronously (e.g., because the
>  * buffer is about to be invalidated); wait asynchronously, using the buffer's
>  * IO wait reference (e.g., because the caller is doing readahead and doesn't
>  * need the buffer to be ready immediately); or to not wait at all (e.g.,
>  * because the caller is trying to combine IO for this buffer with another
>  * buffer).
>  *
>  * How and whether to wait is controlled by the wait in io_wref parameters. In
>  * detail:
>  *
>  * <existing comment>

Sounds good to me.

> > However, I do think you need to enumerate the different combinations
> > of wait and io_wref (as you've done) to make it clear what they are.
> >
> > I, for example, find it very confusing what wait == false and io_wref
> > not NULL would mean. If IO is in progress on the buffer and the
> > io_wref is not valid yet, the caller would get the expected
> > BUFFER_IO_IN_PROGRESS return value but io_wref won't be set. I could
> > see callers easily misinterpreting the API and passing this
> > combination when what they want is wait == true and io_wref not NULL
> > -- because they don't want to synchronously wait.
>
> Hm. I started out proposing that we should just add an assert documenting this
> is a nonsensical combination. But when writing the comment for that I realized
> that it theoretically could make sense to pass in wait == false and io_wref !=
> NULL, if you wanted to get a wait reference, but would not want to do a
> WaitIO() if there's no wait reference set.
>
> I don't think that's something we need right now, but ...

We should somehow express this in the comment.

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-25 21:58                         ` Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-25 21:58 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

Attached is an updated set of patches.  I fixed the bug that Melanie noticed,
updated the comments a bit further and added a new commit that adds a decent
bit of coverage of StartReadBuffers(), including all the cases in
StartSharedBufferIO and most of the cases in StartLocalBufferIO().

I'm planning to commit 0001 soon - it hasn't changed in a while. Then I'd like
to get 0002 committed soon after, but I'll hold off for that until tomorrow,
given that nobody has looked at it (as it's new).  I think 0004-0007 can be
committed too, but I am not sure if you (Melanie) want to do so.

I'd like to get the rest committed tomorrow too.

Greetings,

Andres Freund


^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-26 21:43                           ` Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-26 21:43 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Wed, Mar 25, 2026 at 5:58 PM Andres Freund <[email protected]> wrote:
>
> I'm planning to commit 0001 soon - it hasn't changed in a while. Then I'd like
> to get 0002 committed soon after, but I'll hold off for that until tomorrow,
> given that nobody has looked at it (as it's new).  I think 0004-0007 can be
> committed too, but I am not sure if you (Melanie) want to do so.

This is a review of 0002

in read_buffers():
- you forgot to initialize operation->forknum
- I think the output columns could really use a bit of documentation
- I assume that using uint16 for nblocks and nios is to make sure it
plays nice with the input nblocks which is an int32 -- it doesn't
matter for the range of values you're using, but it would be better if
you could just use uint32 everywhere but I guess because we only have
int4 in SQL you can't.
anyway, I find all the many different types of ints and uints in
read_buffers() pretty sus. Like why do you need this cast to int16?
that seems...wrong and unnecessary?
        values[3] = Int32GetDatum((int16) nblocks_this_io);

in the evict_rel() refactor:
- you need invalidate_one_block() to use the forknum parameter because
otherwise temp tables won't evict any forks except the main fork

for the tests themselves:
- for the first test,
        # check that one larger read is done as multiple reads
isn't the comment actually the opposite of what it is testing?

0|0|t|2 -- would be 1 2 block io starting at 0, no?

seems like something like
# check that consecutive misses are combined into one read
would be better

- for this comment:
# but if we do it again, i.e. it's in s_b, there will be two operations
technically you are also doing this for temp tables, so the comment
isn't entirely correct.

- For this test:
# Verify that we aren't doing reads larger than io_combine_limit
isn't this more just covering the logic in read_buffers()? AFAICT
StartReadBuffers() only worries about the max IOs it can combine if it
is near the segment boundary

- For this:
$psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 1)|);
$psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 2)|);
$psql_a->query_safe(qq|SELECT * FROM read_buffers('$table', 3, 2)|);
psql_like(
        $io_method,
        $psql_a,
        "$persistency: read buffers, miss 1-2, hit 3-4",
        qq|SELECT blockoff, blocknum, needs_io, nblocks FROM
read_buffers('$table', 1, 4)|,
        qr/^0\|1\|t\|2\n2\|3\|f\|1\n3\|4\|f\|1$/,
        qr/^$/);

I think this is a duplicate. There is one before and after the "verify
we aren't doing reads larger than io_combine_limit"

- It may be worth adding one more test case which is IO in progress on
the last block since you have in-progress as the first and the middle
blocks but not as the last block

# Test in-progress IO on the last block of the range
$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
$psql_a->query_safe(
    qq|SELECT read_rel_block_ll('$table', 3, wait_complete=>false)|);
psql_like(
    $io_method,
    $psql_a,
    "$persistency: read buffers, in-progress 3, read 1-3",
    qq|SELECT blockoff, blocknum, needs_io, nblocks FROM
read_buffers('$table', 1, 3)|,
    qr/^0\|1\|t\|2\n2\|3\|f\|1$/,
    qr/^$/);

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-27 00:12                             ` Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-27 00:12 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-26 17:43:17 -0400, Melanie Plageman wrote:
> On Wed, Mar 25, 2026 at 5:58 PM Andres Freund <[email protected]> wrote:
> >
> > I'm planning to commit 0001 soon - it hasn't changed in a while. Then I'd like
> > to get 0002 committed soon after, but I'll hold off for that until tomorrow,
> > given that nobody has looked at it (as it's new).  I think 0004-0007 can be
> > committed too, but I am not sure if you (Melanie) want to do so.
> 
> This is a review of 0002
> 
> in read_buffers():
> - you forgot to initialize operation->forknum

Indeed. I happens to work because of palloc0 and MAIN_FORKNUM == 0, but
clearly that's not right.


> - I think the output columns could really use a bit of documentation

You mean in the C function? On the SQL level the column names seem good
enough for something in a test?

Added a comment for every output column.  Also renamed need_io to
did_io. Thought about renaming nblocks to nblocks_this_io, but that just
seemed too long.


> - I assume that using uint16 for nblocks and nios is to make sure it
> plays nice with the input nblocks which is an int32 -- it doesn't
> matter for the range of values you're using, but it would be better if
> you could just use uint32 everywhere but I guess because we only have
> int4 in SQL you can't.

I actually had the parameters as int16 (or rather int2), but then the callers
needed casts to actually be able to call the function, so I went back to int4.


> anyway, I find all the many different types of ints and uints in
> read_buffers() pretty sus. Like why do you need this cast to int16?
> that seems...wrong and unnecessary?
>         values[3] = Int32GetDatum((int16) nblocks_this_io);

Yea, I went back and forth on the return types and apparently forgot to remove
the cast when going back to to just using int32.


> in the evict_rel() refactor:
> - you need invalidate_one_block() to use the forknum parameter because
> otherwise temp tables won't evict any forks except the main fork

Oops. Fixed.


> for the tests themselves:
> - for the first test,
>         # check that one larger read is done as multiple reads
> isn't the comment actually the opposite of what it is testing?
> 
> 0|0|t|2 -- would be 1 2 block io starting at 0, no?

Yea, not sure how I ended up with that comment. Adopted yours.


> - for this comment:
> # but if we do it again, i.e. it's in s_b, there will be two operations
> technically you are also doing this for temp tables, so the comment
> isn't entirely correct.

Did s/s_b/buffer pool/


> - For this test:
> # Verify that we aren't doing reads larger than io_combine_limit
> isn't this more just covering the logic in read_buffers()? AFAICT
> StartReadBuffers() only worries about the max IOs it can combine if it
> is near the segment boundary

Fair.  I adjusted the comment to remark upon that, but I'm kinda inclined to
keep the test. Just having io_combine_limit sized IOs seems worthwhile?


> - For this:
> $psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 1)|);
> $psql_a->query_safe(qq|SELECT invalidate_rel_block('$table', 2)|);
> $psql_a->query_safe(qq|SELECT * FROM read_buffers('$table', 3, 2)|);
> psql_like(
>         $io_method,
>         $psql_a,
>         "$persistency: read buffers, miss 1-2, hit 3-4",
>         qq|SELECT blockoff, blocknum, needs_io, nblocks FROM
> read_buffers('$table', 1, 4)|,
>         qr/^0\|1\|t\|2\n2\|3\|f\|1\n3\|4\|f\|1$/,
>         qr/^$/);
> 
> I think this is a duplicate. There is one before and after the "verify
> we aren't doing reads larger than io_combine_limit"

Yep.


> - It may be worth adding one more test case which is IO in progress on
> the last block since you have in-progress as the first and the middle
> blocks but not as the last block

Added.


One test used did_io=(t|f). That was actually only needed once "aio: Don't
wait for already in-progress IO" is in, as we might join the foreign IO. I
chose to hide that by making that part of the query "did_io and not
foreign_io", so we would detect if we were to falsely start IO ourselves.


Still need to extend the test as part of the "don't wait" commit, to actually
ensure that we reach the path for joining foreign IO.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-27 17:29                               ` Andres Freund <[email protected]>
  2026-03-27 21:17                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 2 replies; 31+ messages in thread

From: Andres Freund @ 2026-03-27 17:29 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

I think I forgot to update the thread in my last message to note that I
had committed some of the preliminary changes.


On 2026-03-26 20:12:30 -0400, Andres Freund wrote:
> One test used did_io=(t|f). That was actually only needed once "aio: Don't
> wait for already in-progress IO" is in, as we might join the foreign IO. I
> chose to hide that by making that part of the query "did_io and not
> foreign_io", so we would detect if we were to falsely start IO ourselves.

I ended up not liking did_io, as that seems misleading when we just needed to
wait for a foreign IO.  I instead named it io_reqd.


> Still need to extend the test as part of the "don't wait" commit, to actually
> ensure that we reach the path for joining foreign IO.

That's done now.  I've added verification that we don't wrongly recognize
in-progress-ios without a wref as a foreign IO and an injection point based
test that verifies that we do see the foreign IO.

I've also done a bunch of cleanup in the commits. A few typos in commit
messages and the actual code changes and a few larger changes in the test code
& infrastructure. Mostly as part of allowing the aforementioned testing
(read_buffers() now only waits at the end, to make some of the tests
possible), but also just making the modified code a bit cleaner.

Greetings,

Andres Freund


^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-27 21:17                                 ` Melanie Plageman <[email protected]>
  2026-03-27 21:37                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-27 21:17 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Fri, Mar 27, 2026 at 1:29 PM Andres Freund <[email protected]> wrote:
>
> On 2026-03-26 20:12:30 -0400, Andres Freund wrote:
> > One test used did_io=(t|f). That was actually only needed once "aio: Don't
> > wait for already in-progress IO" is in, as we might join the foreign IO. I
> > chose to hide that by making that part of the query "did_io and not
> > foreign_io", so we would detect if we were to falsely start IO ourselves.
>
> I ended up not liking did_io, as that seems misleading when we just needed to
> wait for a foreign IO.  I instead named it io_reqd.

0001 looks good to me except I don't get why you are still passing
MAIN_FORKNUM to PrefetchBuffer() in invalidate_one_block()

In 0002, the test cases look good to me. I haven't gained more
knowledge about injection point related code since my last review, so
still no comment there (inj_io_completion_hook(), etc).

I didn't see anything amiss reviewing by eye. Running it through AI,
it suggested that you should clear stdout between test cases in
test_inject_foreign. I think this seems most relevant because in two
back-to-back tests you are looking for the same output pattern.

It also pointed out that there is a pre-existing bug in
inj_io_short_read_hook() where you pass the wrong parameter to the log
message.

ereport(LOG, errmsg("short read injection point called, is enabled: %d",
                inj_io_error_state->enabled_reopen),
                errhidestmt(true), errhidecontext(true));

should be

ereport(LOG, errmsg("short read injection point called, is enabled: %d",
                inj_io_error_state->enabled_short_read),
                errhidestmt(true), errhidecontext(true));

0003 LGTM.

I am still in the process of reviewing 0004.

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 21:17                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-27 21:37                                   ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 31+ messages in thread

From: Andres Freund @ 2026-03-27 21:37 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-27 17:17:25 -0400, Melanie Plageman wrote:
> On Fri, Mar 27, 2026 at 1:29 PM Andres Freund <[email protected]> wrote:
> >
> > On 2026-03-26 20:12:30 -0400, Andres Freund wrote:
> > > One test used did_io=(t|f). That was actually only needed once "aio: Don't
> > > wait for already in-progress IO" is in, as we might join the foreign IO. I
> > > chose to hide that by making that part of the query "did_io and not
> > > foreign_io", so we would detect if we were to falsely start IO ourselves.
> >
> > I ended up not liking did_io, as that seems misleading when we just needed to
> > wait for a foreign IO.  I instead named it io_reqd.
> 
> 0001 looks good to me except I don't get why you are still passing
> MAIN_FORKNUM to PrefetchBuffer() in invalidate_one_block()

Because I am stupid.


> In 0002, the test cases look good to me. I haven't gained more
> knowledge about injection point related code since my last review, so
> still no comment there (inj_io_completion_hook(), etc).
> 
> I didn't see anything amiss reviewing by eye. Running it through AI,
> it suggested that you should clear stdout between test cases in
> test_inject_foreign. I think this seems most relevant because in two
> back-to-back tests you are looking for the same output pattern.

Yea, that's a good call.

I'll give the BF a bit more time to digest f39cb8c0110 and then will push
0001/0002.


> It also pointed out that there is a pre-existing bug in
> inj_io_short_read_hook() where you pass the wrong parameter to the log
> message.
> 
> ereport(LOG, errmsg("short read injection point called, is enabled: %d",
>                 inj_io_error_state->enabled_reopen),
>                 errhidestmt(true), errhidecontext(true));
> 
> should be
> 
> ereport(LOG, errmsg("short read injection point called, is enabled: %d",
>                 inj_io_error_state->enabled_short_read),
>                 errhidestmt(true), errhidecontext(true));

I'll fix this as part of 0002 which touches related code, an injection point
debug message fixup doesn't seem to deserve its own commit message.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-27 22:59                                 ` Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-27 22:59 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Fri, Mar 27, 2026 at 1:29 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> I think I forgot to update the thread in my last message to note that I
> had committed some of the preliminary changes.
<--snip-->
> I've also done a bunch of cleanup in the commits. A few typos in commit
> messages and the actual code changes and a few larger changes in the test code
> & infrastructure. Mostly as part of allowing the aforementioned testing
> (read_buffers() now only waits at the end, to make some of the tests
> possible), but also just making the modified code a bit cleaner.

Review of 0004:

I noticed I use the word "backend" _a lot_ in this commit message.
Here is an attempt at using it less:

aio: Don't wait for already in-progress IO

When a backend attempts to start a read IO and finds the first buffer
already has I/O in progress, previously it waited for that I/O to
complete before initiating reads for any of the subsequent buffers.

Although it must wait for the I/O to finish when acquiring the buffer,
there's no reason to wait when setting up the read operation. Waiting at
this point prevents starting I/O on subsequent buffers and can
significantly reduce concurrency.

This matters in two workloads: when multiple backends scan the same
relation concurrently, and when a single backend requests the same block
multiple times within the readahead distance.

Waiting each time an in-progress read is encountered effectively
degenerates the access pattern into synchronous I/O.

To fix this, when encountering an already in-progress IO for the head
buffer, the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.

In rare cases, it may still be necessary to wait synchronously at IO
start time: if another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.

Also, I'd say the tests (and the original prototype) qualify you for
co-authorship of the patch, but you do you.

Code and test stuff:

in query_wait_block()

-   $node->poll_query_until('postgres',
-       qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
-       $waitfor);
+   my $waitquery;
+   if ($wait_current_session)
+   {
+       $waitquery =
+         qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid);
+   }
+   else
+   {
+       $waitquery =
+         qq(SELECT wait_event FROM pg_stat_activity WHERE wait_event
= '$waitfor');
+   }
+
+   note "polling for completion with $waitquery";
+   $node->poll_query_until('postgres', $waitquery, $waitfor);

I guess you need WHERE wait_event = $waitfor to keep from getting more
than one row returned and then failing to parse properly. But I got
tripped up on this thinking won't poll_query_until() already give you
that?

I started thinking maybe wait_current_session should be a hash so we
can pass it with a name and it will make the query_wait_block() call
sites less inscrutable, but maybe that's over-engineering.

# Because no IO wref was assigned, block 2 should not report foreign IO
pump_until($psql_a->{run}, $psql_a->{timeout}, \$psql_a->{stdout},
        qr/0\|1\|t\|f\|2\n2\|3\|t\|f\|3/);

you mean block 3

# Because no IO wref was assigned, block 2 should not report foreign IO
pump_until($psql_a->{run}, $psql_a->{timeout}, \$psql_a->{stdout},

should say block 3

# Tests for StartReadBuffers() that dependent on injection point support
s/dependent/depend

You could change the first test
# Test if a read buffers encounters AIO in progress by another backend, it
# recognizes that other IO as a foreign IO.

To have 0 as the foreign IO (which is a slightly different code path
than non-head blocks being the foreign IO) and then you still
basically have coverage of a non-head block being a foreign IO in the
following test case that looks for multiple contiguous blocks being
foreign IO.

In test_read_buffers_inject:
# recognizes that other IO as a foreign IO. This time we encounter the
# foreign IO multiple times.

I find "foriegn IO multiple times" hard to parse. I prefer something
like "multiple
buffers undergoing foreign IO"

# B: Read block 2 and wait for the completion hook to be reached (which could
# be in B itself or in an IO worker)

should say blocks 2-3

I wonder if it is also worth testing a failed foreign IO (i.e.
operation->foreign_io && !(buf_state & BM_VALID) in
WaitReadBuffers()). I don't know that it is much different than the
other failed IO_IN_PROGRESS cases you are already testing.

Otherwise, I think we're ready to go!

- Melanie

^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-28 00:01                                   ` Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-28 00:01 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-27 18:59:36 -0400, Melanie Plageman wrote:
> Review of 0004:
> 
> I noticed I use the word "backend" _a lot_ in this commit message.
> Here is an attempt at using it less:

Mostly adopted.


> Code and test stuff:
> 
> in query_wait_block()
> 
> -   $node->poll_query_until('postgres',
> -       qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
> -       $waitfor);
> +   my $waitquery;
> +   if ($wait_current_session)
> +   {
> +       $waitquery =
> +         qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid);
> +   }
> +   else
> +   {
> +       $waitquery =
> +         qq(SELECT wait_event FROM pg_stat_activity WHERE wait_event
> = '$waitfor');
> +   }
> +
> +   note "polling for completion with $waitquery";
> +   $node->poll_query_until('postgres', $waitquery, $waitfor);
> 
> I guess you need WHERE wait_event = $waitfor to keep from getting more
> than one row returned and then failing to parse properly.

Right.  For some reason poll_query_until() doesn't allow pattern matching, so
there's a pretty hard limit to allowing any variability in output.


> But I got tripped up on this thinking won't poll_query_until() already give
> you that?

Not as far as I can tell.


> I started thinking maybe wait_current_session should be a hash so we
> can pass it with a name and it will make the query_wait_block() call
> sites less inscrutable, but maybe that's over-engineering.

I'm not even seeing how that would work - the problem is that the wait event
might be in an IO worker (in case of io_method=worker) or in the frontend
process (in case of io_method=io_uring).


> # Because no IO wref was assigned, block 2 should not report foreign IO
> pump_until($psql_a->{run}, $psql_a->{timeout}, \$psql_a->{stdout},
>         qr/0\|1\|t\|f\|2\n2\|3\|t\|f\|3/);
> 
> you mean block 3
> 
> # Because no IO wref was assigned, block 2 should not report foreign IO
> pump_until($psql_a->{run}, $psql_a->{timeout}, \$psql_a->{stdout},
> 
> should say block 3

Ooops.


> You could change the first test
> # Test if a read buffers encounters AIO in progress by another backend, it
> # recognizes that other IO as a foreign IO.
> 
> To have 0 as the foreign IO (which is a slightly different code path
> than non-head blocks being the foreign IO) and then you still
> basically have coverage of a non-head block being a foreign IO in the
> following test case that looks for multiple contiguous blocks being
> foreign IO.

Did that, except I also offset to start reading from 1, in the unlikely case
we have something in the path loosing track of block numbers.


> In test_read_buffers_inject:
> # recognizes that other IO as a foreign IO. This time we encounter the
> # foreign IO multiple times.
> 
> I find "foriegn IO multiple times" hard to parse. I prefer something
> like "multiple
> buffers undergoing foreign IO"

Somehow the modified version seemed harder to read to me, so I left it as is.


> # B: Read block 2 and wait for the completion hook to be reached (which could
> # be in B itself or in an IO worker)
> 
> should say blocks 2-3

Fixed.


> I wonder if it is also worth testing a failed foreign IO (i.e.
> operation->foreign_io && !(buf_state & BM_VALID) in
> WaitReadBuffers()). I don't know that it is much different than the
> other failed IO_IN_PROGRESS cases you are already testing.

It might be, but it's getting late, and I would like to see this go in, to
unblock the prefetching thread.  And it'd be somewhat cumbersome to write,
unfortunately. So I'll forgo that for now.


> Otherwise, I think we're ready to go!

Yay.

Pushed.

Thanks for the collab!

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-30 19:00                                     ` Alexander Lakhin <[email protected]>
  2026-03-30 19:14                                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Alexander Lakhin @ 2026-03-30 19:00 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; Melanie Plageman <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hello Andres,

28.03.2026 02:01, Andres Freund wrote:
> Pushed.

As copperhead showed [1], tests added in 020c02bd9 fail when postgres is
built without --enable-cassert. I've reproduced the failure locally with:

./configure -q --enable-debug --enable-tap-tests && make -s -j12 &&
make -s check -C src/test/modules/test_aio/

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2026-03-29%2022%3A01%3A20

Best regards,
Alexander

^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
@ 2026-03-30 19:14                                       ` Melanie Plageman <[email protected]>
  2026-03-30 22:37                                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-30 19:14 UTC (permalink / raw)
  To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Mon, Mar 30, 2026 at 3:00 PM Alexander Lakhin <[email protected]> wrote:
>
> As copperhead showed [1], tests added in 020c02bd9 fail when postgres is
> built without --enable-cassert. I've reproduced the failure locally with:

Yes, it's because read_buffers() (in test_aio.c) uses
operation->nblocks and that's only intialized for buffer hits in
assert builds. The test code could just use the correctly initialized
nblocks out parameter.

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
  2026-03-30 19:14                                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-30 22:37                                         ` Melanie Plageman <[email protected]>
  2026-03-31 18:25                                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-30 22:37 UTC (permalink / raw)
  To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Mon, Mar 30, 2026 at 3:14 PM Melanie Plageman
<[email protected]> wrote:
>
> On Mon, Mar 30, 2026 at 3:00 PM Alexander Lakhin <[email protected]> wrote:
> >
> > As copperhead showed [1], tests added in 020c02bd9 fail when postgres is
> > built without --enable-cassert. I've reproduced the failure locally with:
>
> Yes, it's because read_buffers() (in test_aio.c) uses
> operation->nblocks and that's only intialized for buffer hits in
> assert builds. The test code could just use the correctly initialized
> nblocks out parameter.

Fix was a little more invasive than that. Looks like we were using
operation in more places than I thought. See attached.

- Melanie


Attachments:

  [text/x-patch] v1-0001-Fix-test_aio-read_buffers-to-work-without-cassert.patch (3.6K, 2-v1-0001-Fix-test_aio-read_buffers-to-work-without-cassert.patch)
  download | inline diff:
From 8db56562d13300ca8e1620fbc4ea4e6e3102e3fb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 30 Mar 2026 18:25:46 -0400
Subject: [PATCH v1] Fix test_aio read_buffers() to work without cassert

In a production build, StartReadBuffers() doesn't population all fields
of a ReadBuffersOperation for a buffer hit because no callers use them
(they are populated in assert builds).

The read_buffers() test function relied on some of these fields, though,
so AIO tests failed on non-assert builds (discovered on the
buildfarm after commit 020c02bd908).

Fix by tracking the required information ourselves in read_buffers() and
avoiding reliance on the ReadBuffersOperation unless we know that we did
IO.

Reported-by: Alexander Lakhin <[email protected]>
Discussion: https://postgr.es/m/9ce8f5d8-8ab2-4aa2-b062-c5d74161069c%40gmail.com
---
 src/test/modules/test_aio/test_aio.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index a8267192cb7..d7530681192 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -719,6 +719,7 @@ read_buffers(PG_FUNCTION_ARGS)
 	Buffer	   *buffers;
 	Datum	   *buffers_datum;
 	bool	   *io_reqds;
+	int		   *nblocks_per_io;
 
 	Assert(nblocks > 0);
 
@@ -729,6 +730,7 @@ read_buffers(PG_FUNCTION_ARGS)
 	buffers = palloc0(sizeof(Buffer) * nblocks);
 	buffers_datum = palloc0(sizeof(Datum) * nblocks);
 	io_reqds = palloc0(sizeof(bool) * nblocks);
+	nblocks_per_io = palloc0(sizeof(int) * nblocks);
 
 	rel = relation_open(relid, AccessShareLock);
 	smgr = RelationGetSmgr(rel);
@@ -754,6 +756,7 @@ read_buffers(PG_FUNCTION_ARGS)
 										  startblock + nblocks_done,
 										  &nblocks_this_io,
 										  0);
+		nblocks_per_io[nios] = nblocks_this_io;
 		nios++;
 		nblocks_done += nblocks_this_io;
 	}
@@ -777,7 +780,7 @@ read_buffers(PG_FUNCTION_ARGS)
 	for (int nio = 0; nio < nios; nio++)
 	{
 		ReadBuffersOperation *operation = &operations[nio];
-		int			nblocks_this_io = operation->nblocks;
+		int			nblocks_this_io = nblocks_per_io[nio];
 		Datum		values[6] = {0};
 		bool		nulls[6] = {0};
 		ArrayType  *buffers_arr;
@@ -785,9 +788,8 @@ read_buffers(PG_FUNCTION_ARGS)
 		/* convert buffer array to datum array */
 		for (int i = 0; i < nblocks_this_io; i++)
 		{
-			Buffer		buf = operation->buffers[i];
+			Buffer		buf = buffers[nblocks_disp + i];
 
-			Assert(buffers[nblocks_disp + i] == buf);
 			Assert(BufferGetBlockNumber(buf) == startblock + nblocks_disp + i);
 
 			buffers_datum[nblocks_disp + i] = Int32GetDatum(buf);
@@ -809,8 +811,8 @@ read_buffers(PG_FUNCTION_ARGS)
 		values[2] = BoolGetDatum(io_reqds[nio]);
 		nulls[2] = false;
 
-		/* foreign IO */
-		values[3] = BoolGetDatum(operation->foreign_io);
+		/* foreign IO - only valid when IO was required */
+		values[3] = BoolGetDatum(io_reqds[nio] ? operation->foreign_io : false);
 		nulls[3] = false;
 
 		/* nblocks */
@@ -827,13 +829,8 @@ read_buffers(PG_FUNCTION_ARGS)
 	}
 
 	/* release pins on all the buffers */
-	for (int nio = 0; nio < nios; nio++)
-	{
-		ReadBuffersOperation *operation = &operations[nio];
-
-		for (int i = 0; i < operation->nblocks; i++)
-			ReleaseBuffer(operation->buffers[i]);
-	}
+	for (int i = 0; i < nblocks_done; i++)
+		ReleaseBuffer(buffers[i]);
 
 	/*
 	 * Free explicitly, to have a chance to detect potential issues with too
@@ -843,6 +840,7 @@ read_buffers(PG_FUNCTION_ARGS)
 	pfree(buffers);
 	pfree(buffers_datum);
 	pfree(io_reqds);
+	pfree(nblocks_per_io);
 
 	relation_close(rel, NoLock);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
  2026-03-30 19:14                                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-30 22:37                                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-31 18:25                                           ` Melanie Plageman <[email protected]>
  2026-03-31 18:49                                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Melanie Plageman @ 2026-03-31 18:25 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Tue, Mar 31, 2026 at 8:43 AM Andres Freund <[email protected]> wrote:
>
> Looks good to me.
>
> Will you push?

I was going to push but then Bilal asked me off-list if there was some
reason not to set the members of ReadBuffersOperation outside of
assert builds. I agree with him that it seems like a future user of
StartReadBuffersImpl() could make this same mistake. Both of us
vaguely recall this being done for performance reasons. Before
committing this test change, I wanted to confirm that we don't want to
modify the actual prod code the way he does in [1].

- Melanie

[1] https://www.postgresql.org/message-id/CAN55FZ2-bKNKmMSRDx1xH3SyqwBVMZ8HFG2YNipQ7LCdKm7eKA%40mail.gma...





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
  2026-03-30 19:14                                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-30 22:37                                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-31 18:25                                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-03-31 18:49                                             ` Andres Freund <[email protected]>
  2026-03-31 19:07                                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 31+ messages in thread

From: Andres Freund @ 2026-03-31 18:49 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

Hi,

On 2026-03-31 14:25:49 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 8:43 AM Andres Freund <[email protected]> wrote:
> >
> > Looks good to me.
> >
> > Will you push?
> 
> I was going to push but then Bilal asked me off-list if there was some
> reason not to set the members of ReadBuffersOperation outside of
> assert builds. I agree with him that it seems like a future user of
> StartReadBuffersImpl() could make this same mistake. Both of us
> vaguely recall this being done for performance reasons. Before
> committing this test change, I wanted to confirm that we don't want to
> modify the actual prod code the way he does in [1].

I'd be wary of doing that without performance validation. My memory of the
read stream introduction is that it was pretty hard to not regress the fully
cached path, and that relatively small additions showed up.  But I do agree
it'd be nicer if they were valid.

So I'd be inclined to push your fix for now.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-05 16:56     ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-03 19:47       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-06 13:18         ` Re: Don't synchronously wait for already-in-progress IO in read stream Nazir Bilal Yavuz <[email protected]>
  2026-03-16 21:45           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-17 17:26             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-18 16:59               ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-18 20:16                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-20 19:50                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 15:15                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-25 15:33                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-25 21:58                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-26 21:43                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-27 00:12                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 17:29                               ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-27 22:59                                 ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-28 00:01                                   ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2026-03-30 19:00                                     ` Re: Don't synchronously wait for already-in-progress IO in read stream Alexander Lakhin <[email protected]>
  2026-03-30 19:14                                       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-30 22:37                                         ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-31 18:25                                           ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-03-31 18:49                                             ` Re: Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
@ 2026-03-31 19:07                                               ` Melanie Plageman <[email protected]>
  0 siblings, 0 replies; 31+ messages in thread

From: Melanie Plageman @ 2026-03-31 19:07 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Peter Geoghegan <[email protected]>; Tomas Vondra <[email protected]>

On Tue, Mar 31, 2026 at 2:49 PM Andres Freund <[email protected]> wrote:
>
> So I'd be inclined to push your fix for now.

Done.





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
@ 2026-02-23 19:27     ` Peter Geoghegan <[email protected]>
  2026-03-03 19:48       ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 31+ messages in thread

From: Peter Geoghegan @ 2026-02-23 19:27 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>

On Fri, Jan 23, 2026 at 4:04 PM Melanie Plageman
<[email protected]> wrote:
> Attached v3 basically does what you suggested above. Now, we should
> only have to wait if the backend encounters a buffer after another
> backend has set BM_IO_IN_PROGRESS but before that other backend has
> set the buffer descriptor's wait reference.

Have you considered making ProcessBufferHit into an inline function? I
find that doing so meaningfully improves performance with the index
prefetching patch set. This is particularly true for cached index-only
scans with many VM buffer hits. And it seems to have no downside.

Right now, without any inlining, running perf against a backend that
executes such an index-only scan shows the function/symbol
"ProcessBufferHit.isra.0" as very hot. Apparently gcc does this isra
business ("Interprocedural Scalar Replacement of Aggregates") as an
optimization. Instead of passing the whole struct or pointer, the
caller is rewritten to extract just the necessary scalar values (like
an int or a bool) and pass those directly in registers. But we seem to
be better off fully inlining the function.

--
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 31+ messages in thread

* Re: Don't synchronously wait for already-in-progress IO in read stream
  2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
  2025-11-09 22:20 ` Re: Don't synchronously wait for already-in-progress IO in read stream Thomas Munro <[email protected]>
  2026-01-23 21:03   ` Re: Don't synchronously wait for already-in-progress IO in read stream Melanie Plageman <[email protected]>
  2026-02-23 19:27     ` Re: Don't synchronously wait for already-in-progress IO in read stream Peter Geoghegan <[email protected]>
@ 2026-03-03 19:48       ` Melanie Plageman <[email protected]>
  0 siblings, 0 replies; 31+ messages in thread

From: Melanie Plageman @ 2026-03-03 19:48 UTC (permalink / raw)
  To: Peter Geoghegan <[email protected]>; +Cc: Thomas Munro <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>

On Mon, Feb 23, 2026 at 2:27 PM Peter Geoghegan <[email protected]> wrote:
>
> Have you considered making ProcessBufferHit into an inline function? I
> find that doing so meaningfully improves performance with the index
> prefetching patch set. This is particularly true for cached index-only
> scans with many VM buffer hits. And it seems to have no downside.

Done in recently posted v4. Thanks for the report!

- Melanie





^ permalink  raw  reply  [nested|flat] 31+ messages in thread

end of thread, other threads:[~2026-03-31 19:07 UTC | newest]

Thread overview: 31+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-09-11 21:46 Don't synchronously wait for already-in-progress IO in read stream Andres Freund <[email protected]>
2025-11-09 19:51 ` Peter Geoghegan <[email protected]>
2025-11-09 22:20 ` Thomas Munro <[email protected]>
2026-01-23 21:03   ` Melanie Plageman <[email protected]>
2026-02-05 16:56     ` Nazir Bilal Yavuz <[email protected]>
2026-03-03 19:47       ` Melanie Plageman <[email protected]>
2026-03-03 20:07         ` Melanie Plageman <[email protected]>
2026-03-06 13:18         ` Nazir Bilal Yavuz <[email protected]>
2026-03-16 21:45           ` Melanie Plageman <[email protected]>
2026-03-17 17:26             ` Andres Freund <[email protected]>
2026-03-18 16:59               ` Melanie Plageman <[email protected]>
2026-03-18 20:16                 ` Andres Freund <[email protected]>
2026-03-20 19:50                   ` Melanie Plageman <[email protected]>
2026-03-25 15:15                     ` Andres Freund <[email protected]>
2026-03-25 15:33                       ` Melanie Plageman <[email protected]>
2026-03-25 21:58                         ` Andres Freund <[email protected]>
2026-03-26 21:43                           ` Melanie Plageman <[email protected]>
2026-03-27 00:12                             ` Andres Freund <[email protected]>
2026-03-27 17:29                               ` Andres Freund <[email protected]>
2026-03-27 21:17                                 ` Melanie Plageman <[email protected]>
2026-03-27 21:37                                   ` Andres Freund <[email protected]>
2026-03-27 22:59                                 ` Melanie Plageman <[email protected]>
2026-03-28 00:01                                   ` Andres Freund <[email protected]>
2026-03-30 19:00                                     ` Alexander Lakhin <[email protected]>
2026-03-30 19:14                                       ` Melanie Plageman <[email protected]>
2026-03-30 22:37                                         ` Melanie Plageman <[email protected]>
2026-03-31 18:25                                           ` Melanie Plageman <[email protected]>
2026-03-31 18:49                                             ` Andres Freund <[email protected]>
2026-03-31 19:07                                               ` Melanie Plageman <[email protected]>
2026-02-23 19:27     ` Peter Geoghegan <[email protected]>
2026-03-03 19:48       ` Melanie Plageman <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox