Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2xJa-000lOv-2h for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Mar 2026 20:16:27 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w2xJZ-00E8EQ-2Q for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Mar 2026 20:16:25 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2xJY-00E8Du-1m for pgsql-hackers@lists.postgresql.org; Wed, 18 Mar 2026 20:16:25 +0000 Received: from fhigh-a8-smtp.messagingengine.com ([103.168.172.159]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w2xJP-00000000Q9o-1InL for pgsql-hackers@postgresql.org; Wed, 18 Mar 2026 20:16:18 +0000 Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfhigh.phl.internal (Postfix) with ESMTP id 870011400155; Wed, 18 Mar 2026 16:16:15 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Wed, 18 Mar 2026 16:16:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1773864975; x=1773951375; bh=suDHg2Rr+2NcSmkZJTnBdOd5CQ7vkSMObI/fYja1YPY=; b= tZSyZhGtyrSiFFOM5DO+Y7AbvLfz/QA/n/QIR6gIHwg0YpDjdWmkTctv/4SJlagj oQlpeq162HKr2PEVukGrmDXdFXlSJs8VZwDWJUToeG2yx0Qg+8tLxVIELyR7KYcL JGnsZIrWRytIInbjYMNq6EJsenssqBBFjd296GeVorztdlpgdaNVU/aS69P3Jkbz YcUA1G9jX3QO6fCER6GZ461sVRDzo3FTOQChqUeFNwR2AH2+K9AUT1ZRrd4xAPc2 aQqzJE1XOaNxpnb2LCTEIfGsQ828BzBKchzgiHUVjrs3SCIwtk2nliBs4lIysmYq Zd/kiMmIuNL2PO3BvNDXmA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1773864975; x= 1773951375; bh=suDHg2Rr+2NcSmkZJTnBdOd5CQ7vkSMObI/fYja1YPY=; b=B qrN7U/b04l3hu/nHYgn5SRBb9UzkmfbPKaOLslKeXXEePg6TZFxC5zSVWL7zEW8f 3KiTNptc0aRufT3ijXzWkfc3eOOZNgSPpDVbuKAGzD4o7Q1BjHNmIpdAHRVXCMSz W6X25q2qE+pr9UFKuMVMU6KFj22aSfdcqJ8oHS4kkQAruUc8YwTKYOthMMtHroVO owdgDEAfScdPJaaxI5KVrYwDitKNDxE9L6WLNivPFrQR8f9J9eFW8V83HJRZIyhx 4dttHUkArOVH3RbUUb5o47JAaU5vt/0Bs1qGWaYh4CqhdMJItk6K4+d5bsUZy8/T Z22s4P9vrgoi5p570Sozw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdeftdehtdejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpedtleelvdfgjedvffeiueekfeeuleffhfegfffhgfffkeevueehieehhfei gffhvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grnhgurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepiedpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepth hvsehfuhiiiiihrdgtiidprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtgho mhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpd hrtghpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthht ohepphhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 18 Mar 2026 16:16:14 -0400 (EDT) Date: Wed, 18 Mar 2026 16:16:14 -0400 From: Andres Freund To: Melanie Plageman Cc: Nazir Bilal Yavuz , Thomas Munro , pgsql-hackers@postgresql.org, Peter Geoghegan , Tomas Vondra Subject: Re: Don't synchronously wait for already-in-progress IO in read stream Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-18 12:59:11 -0400, Melanie Plageman wrote: > On Tue, Mar 17, 2026 at 1:26 PM Andres Freund wrote: > > > > > --- a/src/backend/storage/buffer/bufmgr.c > > > +++ b/src/backend/storage/buffer/bufmgr.c > > > @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress) > > > * must have started out as a miss in PinBufferForBlock(). The other > > > * backend will track this as a 'read'. > > > */ > > > - TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done, > > > + TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1, > > > operation->smgr->smgr_rlocator.locator.spcOid, > > > operation->smgr->smgr_rlocator.locator.dbOid, > > > operation->smgr->smgr_rlocator.locator.relNumber, > > > -- > > > > Ah, the issue is that we already incremented nblocks_done, right? Maybe it'd > > be easier to understand if we stashed blocknum + nblocks_done into a local > > var, and use it in in both branches of if (!ReadBuffersCanStartIO())? > > > > This probably needs to be backpatched... > > 0003 in v6 does as you suggest. I'll backport it after a quick +1 here. LGTM. > > > @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel, > > > smgr->smgr_rlocator.backend); > > > > > > if (persistence == RELPERSISTENCE_TEMP) > > > > And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP > > branch in CountBufferHit(), I suspect the compiler may not be able to optimize > > it away. > > And you think it is optimizing it away in PinBufferForBlock()? It doesn't really have a choice :) - the pgBufferUsage.(local|shared)_blks_hit++ is within the already required if (persistence == RELPERSISTENCE_TEMP) so there's not really a branch to optimize away. But maybe I misunderstood? > > At the very least I'd invert the call to CountBufferHit() and the > > pgstat_count_buffer_read(), as the latter will probably prevent most > > optimizations (due to the compiler not being able to prove that > > (rel)->pgstat_info->counts.blocks_fetched is a different memory location as > > *foundPtr). > > I did this. I did not check the compiled code before or after though. I checked after and it looks good (well, ok enough, but that's unrelated to your changes). Just to verify that any of this actually matters, I did some benchmarking with the call to TrackBufferHit() removed, and with pg_prewarm() of a scale 100 pgbench_accounts I do see about ~3% of a gain from that. I did also verify that with the patch we're ever so slightly, but reproducibly, faster than master. There's future optimization potential for sure though. > > > +CountBufferHit(BufferAccessStrategy strategy, > > > + Relation rel, char persistence, SMgrRelation smgr, > > > + ForkNumber forknum, BlockNumber blocknum) > > > > I don't think "Count*" is a great name for something that does tracepoints and > > vacuum cost balance accounting, the latter actually changes behavior of the > > program due to the sleeps it injects. > > > > The first alternative I have is AccountForBufferHit(), not great, but still > > seems a bit better. > > At some point, I had ProcessBufferHit(), but Bilal felt it implied the > function did more than counting. I've changed it now to > TrackBufferHit(). WFM. > > > + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to > > > + * avoid an external function call. > > > + */ > > > +static PrepareReadBuffer_Status > > > +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation, > > > + Buffer buffer) > > > > Hm, seems the test in 0002 should be extended to cover the the temp table case. > > I did this. However, I was a bit lazy in how many cases I added > because I used invalidate_rel_block(), which is pretty verbose (since > evict_rel() doesn't work yet for local buffers). Ah, yea, that's annoying. I think some basic coverage is good enough for now. > I don't think we'll be able to easily test READ_BUFFER_ALREADY_DONE > (though perhaps we aren't testing it for shared buffers either?). We do reach the READ_BUFFER_ALREADY_DONE in PrepareHeadBufferReadIO(), but only due to io_method=sync peculiarities (as that only actually performs the IO later when waiting, it's easy to have two IOs for the same block). It's probably worth adding tests for that, although I suspect it should be in 001_aio.pl - no read stream required to hit it. I can give it a shot, if you want? > > > +static PrepareReadBuffer_Status > > > +PrepareNewReadBufferIO(ReadBuffersOperation *operation, > > > + Buffer buffer) > > > +{ > > > > I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" & > > "Continue"? Or "First" & "Additional"? Or ... > > I changed the names to PrepareHeadBufferReadIO() and > PrepareAdditionalBufferReadIO(). "Head" instead of "First" because > First felt like it implied the first buffer ever and head seems to > make it clear it is the first buffer of this new IO. Head works! > Subject: [PATCH v6 4/8] Pass io_object and io_context through to > PinBufferForBlock() The duplication due to handling the RBM_ZERO_AND_CLEANUP_LOCK case is a bit annoying, but I think it's still an improvement. > Subject: [PATCH v6 5/8] Make buffer hit helper LGTM. > From b73b896febc35253ca2607cb0fe143355b91256f Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Wed, 18 Mar 2026 11:09:58 -0400 > Subject: [PATCH v6 6/8] Restructure AsyncReadBuffers() > > Restructure AsyncReadBuffers() to use early return when the head buffer > is already valid, instead of using a did_start_io flag and if/else > branches. Also move around a bit of the code to be located closer to > where it is used. This is a refactor only. I think there should be as little work as posbile between setting IO_IN_PROGRESS and starting the IO. Most of the work you deferred is cheap enough that it shouldn't matter, but pgstat_prepare_report_checksum_failure() might need to do a bit more (including taking lwlocks and stuff). I'm also a bit doubtful that deferring the flag determinations is a good idea, mostly because it adds a bunch of stuff between starting IO on the head and subsequent buffers. Not that it's expensive, but it seems to make it more likely that somebody would end up putting other code inbetween the head and additional buffer IO starts. And it's cheap enough that it doesn't matter to waste it if we return early. > +/* > + * When building a new IO from multiple buffers, we won't include buffers > + * that are already valid or already in progress. This function should only be > + * used for additional adjacent buffers following the head buffer in a new IO. > + * > + * This function must never wait for IO to avoid deadlocks. The head buffer > + * already has BM_IO_IN_PROGRESS set, so we'll just issue that IO and come > + * back in lieu of waiting here. "come back" is a bit odd, since you'd not actually come back to PrepareAdditionalBufferReadIO(). Looks good otherwise. > From 63cb731176a62320d296f968b12a5d4d36e703d0 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Wed, 18 Mar 2026 11:17:57 -0400 > Subject: [PATCH v6 8/8] AIO: Don't wait for already in-progress IO > > When a backend attempts to start a read IO and finds the first buffer > already has I/O in progress, previously it waited for that I/O to > complete before initiating reads for any of the subsequent buffers. > > Although the backend must wait for the I/O to finish when acquiring the > buffer, there's no reason for it to wait when setting up the read > operation. Waiting at this point prevents the backend from starting I/O > on subsequent buffers and can significantly reduce concurrency. > > This matters in two workloads: when multiple backends scan the same > relation concurrently, and when a single backend requests the same block > multiple times within the readahead distance. > > If backends wait each time they encounter an in-progress read, > the access pattern effectively degenerates into synchronous I/O. > > To fix this, when encountering an already in-progress IO for the head > buffer, a backend now records the buffer's wait reference and defers > waiting until WaitReadBuffers(), when it actually needs to acquire the > buffer. > > In rare cases, a backend may still need to wait synchronously at IO > start time: if another backend has set BM_IO_IN_PROGRESS on the buffer > but has not yet set the wait reference. Such windows should be brief and > uncommon. > @@ -1663,8 +1677,9 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete) > * Local version of PrepareHeadBufferReadIO(). Here instead of localbuf.c to > * avoid an external function call. > */ > -static bool > -PrepareHeadLocalBufferReadIO(Buffer buffer) > +static PrepareReadBufferStatus > +PrepareHeadLocalBufferReadIO(ReadBuffersOperation *operation, > + Buffer buffer) > { > BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1); > uint64 buf_state = pg_atomic_read_u64(&desc->state); > @@ -1675,49 +1690,60 @@ PrepareHeadLocalBufferReadIO(Buffer buffer) > * handle). Only the owning backend can set BM_VALID on a local buffer. > */ > if (buf_state & BM_VALID) > - return false; > + return READ_BUFFER_ALREADY_DONE; > > /* > * Submit any staged IO before checking for in-progress IO. Without this, > * the wref check below could find IO that this backend staged but hasn't > - * submitted yet. Waiting on that would PANIC because the owner can't wait > - * on its own staged IO. > + * submitted yet. If we returned READ_BUFFER_IN_PROGRESS and > + * WaitReadBuffers() then tried to wait on it, we'd PANIC because the > + * owner can't wait on its own staged IO. > */ > pgaio_submit_staged(); > > - /* Wait for in-progress IO */ > + /* We've already asynchronously started this IO, so join it */ > if (pgaio_wref_valid(&desc->io_wref)) > { > - PgAioWaitRef iow = desc->io_wref; > - > - pgaio_wref_wait(&iow); > - > - buf_state = pg_atomic_read_u64(&desc->state); > + operation->io_wref = desc->io_wref; > + operation->foreign_io = true; > + return READ_BUFFER_IN_PROGRESS; > } > > - /* > - * If BM_VALID is set, we waited on IO and it completed successfully. > - * Otherwise, we'll initiate IO on the buffer. > - */ > - return !(buf_state & BM_VALID); > + /* Prepare to start IO on this buffer */ > + return READ_BUFFER_READY_FOR_IO; > } Hm. Is buf_state & BM_VALID actually not reachable anymore? What if the pgaio_submit_staged() completed the IO? In that case there won't be a wref and we'll get here with buf_state & BM_VALID. > @@ -1732,11 +1758,25 @@ PrepareHeadBufferReadIO(Buffer buffer) > if (buf_state & BM_VALID) > { > UnlockBufHdr(desc); > - return false; > + return READ_BUFFER_ALREADY_DONE; > } > > if (buf_state & BM_IO_IN_PROGRESS) > { > + /* Join existing read */ > + if (pgaio_wref_valid(&desc->io_wref)) > + { > + operation->io_wref = desc->io_wref; > + operation->foreign_io = true; > + UnlockBufHdr(desc); > + return READ_BUFFER_IN_PROGRESS; > + } Out of a strict sense of rule-following, I'd do the operation->foreign_io after the UnlockBufHdr(), since it doesn't actually need to be in the locked section. > @@ -1959,11 +2002,45 @@ WaitReadBuffers(ReadBuffersOperation *operation) > Assert(pgaio_wref_check_done(&operation->io_wref)); > } > > - /* > - * We now are sure the IO completed. Check the results. This > - * includes reporting on errors if there were any. > - */ > - ProcessReadBuffersResult(operation); > + if (unlikely(operation->foreign_io)) > + { > + Buffer buffer = operation->buffers[operation->nblocks_done]; > + BufferDesc *desc = BufferIsLocal(buffer) ? > + GetLocalBufferDescriptor(-buffer - 1) : > + GetBufferDescriptor(buffer - 1); > + uint64 buf_state = pg_atomic_read_u64(&desc->state); > + > + if (buf_state & BM_VALID) > + { > + BlockNumber blocknum = operation->blocknum + operation->nblocks_done; Maybe we should assert that the buffer's block equals what we expect? Greetings, Andres Freund