Feedback-ID: id4a34324:Fastmail
Date: Wed, 18 Mar 2026 16:16:14 -0400
From: Andres Freund <andres@anarazel.de>
To: Melanie Plageman <melanieplageman@gmail.com>
Cc: Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>, pgsql-hackers@postgresql.org,
 Peter Geoghegan <pg@bowt.ie>,
	Tomas Vondra <tv@fuzzy.cz>
Subject: Re: Don't synchronously wait for already-in-progress IO in read
 stream
Message-ID: <prkj4fv7dvyew4lbad2g5l346nn6u66inndwx7nzzap5c3bdmh@fds54c2hx3yd>
References: <zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv>
 <CA+hUKGKwzPJNcxCFUswe0K=0e7rG79dvqt9o17-E3soxLPxF+Q@mail.gmail.com>
 <CAAKRu_atdnPCCy=kfxqWT62Ckaiz3G5t=S97tW24CuL3i3fFfQ@mail.gmail.com>
 <CAN55FZ17ScU--nGM8cjjPtwPMOMonKZ9vMehPcc2sOQNLVskvA@mail.gmail.com>
 <CAAKRu_ZM1epxTdt2=4-g4ff6UC09zne+0xg_gNv3d7LEcxEvNA@mail.gmail.com>
 <CAN55FZ0ymY=XfaHQXHKG+Mbit6BZzZ9d-UWb6dmRXYm_xCvfQg@mail.gmail.com>
 <CAAKRu_YV0D_9wWgGkHdWgQsooqmcc08Uqb3tqxzrwcDXHajDtA@mail.gmail.com>
 <u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2@hph6fvdiiria>
 <CAAKRu_bCZfWpmfmfUiQjDgFK12T8FUox-hV9YFDy6XHTh=i0ew@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: 
 <CAAKRu_bCZfWpmfmfUiQjDgFK12T8FUox-hV9YFDy6XHTh=i0ew@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/prkj4fv7dvyew4lbad2g5l346nn6u66inndwx7nzzap5c3bdmh%40fds54c2hx3yd>
Precedence: bulk

Hi,

On 2026-03-18 12:59:11 -0400, Melanie Plageman wrote:
> On Tue, Mar 17, 2026 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > --- a/src/backend/storage/buffer/bufmgr.c
> > > +++ b/src/backend/storage/buffer/bufmgr.c
> > > @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
> > >                * must have started out as a miss in PinBufferForBlock(). The other
> > >                * backend will track this as a 'read'.
> > >                */
> > > -             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> > > +             TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.spcOid,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.dbOid,
> > >                                                                                 operation->smgr->smgr_rlocator.locator.relNumber,
> > > --
> >
> > Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
> > be easier to understand if we stashed blocknum + nblocks_done into a local
> > var, and use it in in both branches of if (!ReadBuffersCanStartIO())?
> >
> > This probably needs to be backpatched...
> 
> 0003 in v6 does as you suggest. I'll backport it after a quick +1 here.

LGTM.


> > > @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
> > >                                                                          smgr->smgr_rlocator.backend);
> > >
> > >       if (persistence == RELPERSISTENCE_TEMP)
> >
> > And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
> > branch in CountBufferHit(), I suspect the compiler may not be able to optimize
> > it away.
> 
> And you think it is optimizing it away in PinBufferForBlock()?

It doesn't really have a choice :) - the
  pgBufferUsage.(local|shared)_blks_hit++
is within the already required
  if (persistence == RELPERSISTENCE_TEMP)

so there's not really a branch to optimize away.

But maybe I misunderstood?


> > At the very least I'd invert the call to CountBufferHit() and the
> > pgstat_count_buffer_read(), as the latter will probably prevent most
> > optimizations (due to the compiler not being able to prove that
> > (rel)->pgstat_info->counts.blocks_fetched is a different memory location as
> > *foundPtr).
> 
> I did this. I did not check the compiled code before or after though.

I checked after and it looks good (well, ok enough, but that's unrelated to
your changes).


Just to verify that any of this actually matters, I did some benchmarking with
the call to TrackBufferHit() removed, and with pg_prewarm() of a scale 100
pgbench_accounts I do see about ~3% of a gain from that.  I did also verify
that with the patch we're ever so slightly, but reproducibly, faster than
master.  There's future optimization potential for sure though.


> > > +CountBufferHit(BufferAccessStrategy strategy,
> > > +                        Relation rel, char persistence, SMgrRelation smgr,
> > > +                        ForkNumber forknum, BlockNumber blocknum)
> >
> > I don't think "Count*" is a great name for something that does tracepoints and
> > vacuum cost balance accounting, the latter actually changes behavior of the
> > program due to the sleeps it injects.
> >
> > The first alternative I have is AccountForBufferHit(), not great, but still
> > seems a bit better.
> 
> At some point, I had ProcessBufferHit(), but Bilal felt it implied the
> function did more than counting. I've changed it now to
> TrackBufferHit().

WFM.


> > > + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> > > + * avoid an external function call.
> > > + */
> > > +static PrepareReadBuffer_Status
> > > +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> > > +                                                     Buffer buffer)
> >
> > Hm, seems the test in 0002 should be extended to cover the the temp table case.
> 
> I did this. However, I was a bit lazy in how many cases I added
> because I used invalidate_rel_block(), which is pretty verbose (since
> evict_rel() doesn't work yet for local buffers).

Ah, yea, that's annoying.  I think some basic coverage is good enough for now.


> I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE
> (though perhaps we aren't testing it for shared buffers either?).

We do reach the READ_BUFFER_ALREADY_DONE in PrepareHeadBufferReadIO(), but
only due to io_method=sync peculiarities (as that only actually performs the
IO later when waiting, it's easy to have two IOs for the same block).


It's probably worth adding tests for that, although I suspect it should be in
001_aio.pl - no read stream required to hit it.  I can give it a shot, if you
want?


> > > +static PrepareReadBuffer_Status
> > > +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> > > +                                        Buffer buffer)
> > > +{
> >
> > I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
> > "Continue"? Or "First" & "Additional"?  Or ...
> 
> I changed the names to PrepareHeadBufferReadIO() and
> PrepareAdditionalBufferReadIO(). "Head" instead of "First" because
> First felt like it implied the first buffer ever and head seems to
> make it clear it is the first buffer of this new IO.

Head works!


> Subject: [PATCH v6 4/8] Pass io_object and io_context through to
>  PinBufferForBlock()

The duplication due to handling the RBM_ZERO_AND_CLEANUP_LOCK case is a bit
annoying, but I think it's still an improvement.


> Subject: [PATCH v6 5/8] Make buffer hit helper

LGTM.


> From b73b896febc35253ca2607cb0fe143355b91256f Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 18 Mar 2026 11:09:58 -0400
> Subject: [PATCH v6 6/8] Restructure AsyncReadBuffers()
> 
> Restructure AsyncReadBuffers() to use early return when the head buffer
> is already valid, instead of using a did_start_io flag and if/else
> branches. Also move around a bit of the code to be located closer to
> where it is used. This is a refactor only.

I think there should be as little work as posbile between setting
IO_IN_PROGRESS and starting the IO. Most of the work you deferred is cheap
enough that it shouldn't matter, but pgstat_prepare_report_checksum_failure()
might need to do a bit more (including taking lwlocks and stuff).

I'm also a bit doubtful that deferring the flag determinations is a good idea,
mostly because it adds a bunch of stuff between starting IO on the head and
subsequent buffers. Not that it's expensive, but it seems to make it more
likely that somebody would end up putting other code inbetween the head and
additional buffer IO starts.  And it's cheap enough that it doesn't matter to
waste it if we return early.


> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only be
> + * used for additional adjacent buffers following the head buffer in a new IO.
> + *
> + * This function must never wait for IO to avoid deadlocks. The head buffer
> + * already has BM_IO_IN_PROGRESS set, so we'll just issue that IO and come
> + * back in lieu of waiting here.

"come back" is a bit odd, since you'd not actually come back to
PrepareAdditionalBufferReadIO().


Looks good otherwise.


> From 63cb731176a62320d296f968b12a5d4d36e703d0 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 18 Mar 2026 11:17:57 -0400
> Subject: [PATCH v6 8/8] AIO: Don't wait for already in-progress IO
> 
> When a backend attempts to start a read IO and finds the first buffer
> already has I/O in progress, previously it waited for that I/O to
> complete before initiating reads for any of the subsequent buffers.
> 
> Although the backend must wait for the I/O to finish when acquiring the
> buffer, there's no reason for it to wait when setting up the read
> operation. Waiting at this point prevents the backend from starting I/O
> on subsequent buffers and can significantly reduce concurrency.
> 
> This matters in two workloads: when multiple backends scan the same
> relation concurrently, and when a single backend requests the same block
> multiple times within the readahead distance.
> 
> If backends wait each time they encounter an in-progress read,
> the access pattern effectively degenerates into synchronous I/O.
> 
> To fix this, when encountering an already in-progress IO for the head
> buffer, a backend now records the buffer's wait reference and defers
> waiting until WaitReadBuffers(), when it actually needs to acquire the
> buffer.
> 
> In rare cases, a backend may still need to wait synchronously at IO
> start time: if another backend has set BM_IO_IN_PROGRESS on the buffer
> but has not yet set the wait reference. Such windows should be brief and
> uncommon.


> @@ -1663,8 +1677,9 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
>   * Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to
>   * avoid an external function call.
>   */
> -static bool
> -PrepareHeadLocalBufferReadIO(Buffer buffer)
> +static PrepareReadBufferStatus
> +PrepareHeadLocalBufferReadIO(ReadBuffersOperation *operation,
> +							 Buffer buffer)
>  {
>  	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
>  	uint64		buf_state = pg_atomic_read_u64(&desc->state);
> @@ -1675,49 +1690,60 @@ PrepareHeadLocalBufferReadIO(Buffer buffer)
>  	 * handle). Only the owning backend can set BM_VALID on a local buffer.
>  	 */
>  	if (buf_state & BM_VALID)
> -		return false;
> +		return READ_BUFFER_ALREADY_DONE;
>  
>  	/*
>  	 * Submit any staged IO before checking for in-progress IO. Without this,
>  	 * the wref check below could find IO that this backend staged but hasn't
> -	 * submitted yet. Waiting on that would PANIC because the owner can't wait
> -	 * on its own staged IO.
> +	 * submitted yet. If we returned READ_BUFFER_IN_PROGRESS and
> +	 * WaitReadBuffers() then tried to wait on it, we'd PANIC because the
> +	 * owner can't wait on its own staged IO.
>  	 */
>  	pgaio_submit_staged();
>  
> -	/* Wait for in-progress IO */
> +	/* We've already asynchronously started this IO, so join it */
>  	if (pgaio_wref_valid(&desc->io_wref))
>  	{
> -		PgAioWaitRef iow = desc->io_wref;
> -
> -		pgaio_wref_wait(&iow);
> -
> -		buf_state = pg_atomic_read_u64(&desc->state);
> +		operation->io_wref = desc->io_wref;
> +		operation->foreign_io = true;
> +		return READ_BUFFER_IN_PROGRESS;
>  	}
>  
> -	/*
> -	 * If BM_VALID is set, we waited on IO and it completed successfully.
> -	 * Otherwise, we'll initiate IO on the buffer.
> -	 */
> -	return !(buf_state & BM_VALID);
> +	/* Prepare to start IO on this buffer */
> +	return READ_BUFFER_READY_FOR_IO;
>  }


Hm. Is buf_state & BM_VALID actually not reachable anymore? What if the
pgaio_submit_staged() completed the IO? In that case there won't be a wref and
we'll get here with buf_state & BM_VALID.


> @@ -1732,11 +1758,25 @@ PrepareHeadBufferReadIO(Buffer buffer)
>  		if (buf_state & BM_VALID)
>  		{
>  			UnlockBufHdr(desc);
> -			return false;
> +			return READ_BUFFER_ALREADY_DONE;
>  		}
>  
>  		if (buf_state & BM_IO_IN_PROGRESS)
>  		{
> +			/* Join existing read */
> +			if (pgaio_wref_valid(&desc->io_wref))
> +			{
> +				operation->io_wref = desc->io_wref;
> +				operation->foreign_io = true;
> +				UnlockBufHdr(desc);
> +				return READ_BUFFER_IN_PROGRESS;
> +			}

Out of a strict sense of rule-following, I'd do the operation->foreign_io
after the UnlockBufHdr(), since it doesn't actually need to be in the locked
section.


> @@ -1959,11 +2002,45 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>  				Assert(pgaio_wref_check_done(&operation->io_wref));
>  			}
>  
> -			/*
> -			 * We now are sure the IO completed. Check the results. This
> -			 * includes reporting on errors if there were any.
> -			 */
> -			ProcessReadBuffersResult(operation);
> +			if (unlikely(operation->foreign_io))
> +			{
> +				Buffer		buffer = operation->buffers[operation->nblocks_done];
> +				BufferDesc *desc = BufferIsLocal(buffer) ?
> +					GetLocalBufferDescriptor(-buffer - 1) :
> +					GetBufferDescriptor(buffer - 1);
> +				uint64		buf_state = pg_atomic_read_u64(&desc->state);
> +
> +				if (buf_state & BM_VALID)
> +				{
> +					BlockNumber blocknum = operation->blocknum + operation->nblocks_done;

Maybe we should assert that the buffer's block equals what we expect?


Greetings,

Andres Freund