Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8lFY-000qoG-0c for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 20:36:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8lFW-00E1PP-04 for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 20:36:14 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8lFU-00E1PH-2n for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 20:36:14 +0000 Received: from fhigh-b6-smtp.messagingengine.com ([202.12.124.157]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w8lFR-00000000R0l-0CrA for pgsql-hackers@postgresql.org; Fri, 03 Apr 2026 20:36:13 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfhigh.stl.internal (Postfix) with ESMTP id C31E67A00E6; Fri, 3 Apr 2026 16:36:05 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-03.internal (MEProxy); Fri, 03 Apr 2026 16:36:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1775248565; x=1775334965; bh=3w4CbfWTd9BGEN6vovqVmN9U/IgP4Cz/qwv4jEP8xpc=; b= Q+mtNgCAzZWi2TvQvYQW03muECey++uXPf+bsU/5hkOrkpyYWq63sIJ0NwjBTdJV bWlr88WNeQP073dZqHyPpxwXO0NiagGHRwk0ErWrnCFIAJTyZbytxTHxBgEedouO 9SEl7ixlEj51T5jtLKzJ5r88KbaAcDhTjFn4WUy0jliC7wcCTf27ZWN2u9TzWPEe Rw86qhAAwTKjDLEeNBj379Wb4KRSk4Ezs4yOMAXuo9dVumFExZckojiR0lWprNMT 3Q+/60T5mszh9qYIvhuI3tybNnxymHDydK7csO0mu0uLVEL6Ryoni3KNqEwf4q9F aDpmQox4bfH3pdsaTpQviw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1775248565; x= 1775334965; bh=3w4CbfWTd9BGEN6vovqVmN9U/IgP4Cz/qwv4jEP8xpc=; b=u C9eWwKu+kMBqx5Ib5akKMTMPyzg9jltzDYgMMHaEEKJPPrk5yDJmJGe2znFUtPrX AdFJpG7d9nD1uydGeVGPxZ0FmgYsxuRW6B1U/p+TEcpihL9gK/AlofF+8Ecxwyxp chavUKdcjDCNmb9Pi74QZE3MVRQZs2mbzys/PVtJMsFCkqVqUu9KJFs/AxhPFxbz t69+rKAI7K+0qdtMsL14aDt54oEj2nlsu3Ps2D+FIboxLxCSLvlRw3sGx9gXP3Ba QBIf/+2iBZcdh2DZMzWmvPxvi4DPb9gkJgxqPr9WOJtK0ShZdufuJ6S2MhhJad7u rBNoXzwugRJfRmmZ7R/Vw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdelledvucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh epfffhvfevuffkfhggtggugfgjsehmkefsredttdejnecuhfhrohhmpeetnhgurhgvshcu hfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrghtth gvrhhnpeejudejfeffhffhgfdvteefteekveeflefgleejjeejtefhteehuefhvedtffev ueenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnh gurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepiedpmhhouggvpehs mhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhvse hfuhiiiiihrdgtiidprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdp rhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtg hpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohep phhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 3 Apr 2026 16:36:04 -0400 (EDT) Date: Fri, 3 Apr 2026 16:36:03 -0400 From: Andres Freund To: Melanie Plageman Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Subject: Re: AIO / read stream heuristics adjustments for index prefetching Message-ID: <24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw> References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="mq5mpczlv4j3v4zj" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --mq5mpczlv4j3v4zj Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Hi, On 2026-04-03 15:04:51 -0400, Melanie Plageman wrote: > On Fri, Apr 3, 2026 at 1:30 PM Andres Freund wrote: > > > > > - why not remove the combine_distance requirement from the fast path > > > entry criteria (could save resume_combine_distance in the fast path > > > and restore it after a miss) > > > > Because entering the fast path prevents IO combining from happening. So it's > > absolutely crucial that we do *not* enter it while we would combine. > > But if it is a buffer hit, obviously we can't do IO combining anyway, > or am I misunderstanding the fast path's common case? It's true that we can't do combining in the fast path, but the problem is that with eic=0/1 (or a recent history that leaves us with low distances or a low pin limit), we will not start the next IO until there are no more buffers pinned. Imagine that we started one 16 block IO and have a readahead_distance of 1. After consuming 15 buffers, we will have one more buffer pinned, but read_stream_look_ahead() will not yet start another IO, due to the readahead_distance condition (or max_pinned_buffers or ...). Without the stream->combine_distance == 1 check, the subsequent check for read_stream_next_buffer() would consider this a valid case for entering fast-path. > > > You mentioned that we don't want to read too far ahead (including for > > > a single combined IO) in part because: > > > > > > > The resowner and private refcount mechanisms take more CPU cycles if you have > > > > more buffers pinned > > > > > > But I don't see how either distance is responding to this or > > > self-calibrating with this in mind > > > > Using the minimal required distance to avoid needing to wait for IO completion > > is responding to that, no? Without these patches we read ahead as far as > > possible, even if all the data is in the page cache, which makes this issue > > way worse (without these patches it's a major source of regressions in the > > index prefetching patch). > > But we aren't using the minimal distance to avoid needing to wait for > IO completion. We are also using a higher distance to try and get IO > combining and toallow for async copying into the kernel buffer cache, > etc, etc. My testing suggests that doing IO combining for a reasonble io_combine_limit is pretty much always a win in a steady-state stream (i.e. not a short one that's not fully consumed), the gain from avoiding the larger amounts of syscalls sufficiently large. One we start doing async copying from the kernel page cache, we will have to wait for the completion of that async work, which will lead to readahead_distance being increased if necessary. > There's a lot of different considerations; it isn't just two opposing > forces. It's not, but I think always performing io_combine_limit sized IOs after a ramp-up and increasing the distance based on needing to wait is a pretty decent heuristic. For best results it does require pgaio_uring_should_use_async() to trigger, as otherwise we do not get get the parallelized memory copy. Which means it may never trigger if we don't occasionally reach the size based condition. Luckily it does not seem like using async is beneficial for small IOs. > And, I'd imagine that the relationship between the > number of buffers pinned and CPU cycles required for resowner/refcount > isn't perfectly linear. It's definitely not. > I'm not saying that we don't do IO combining at high distances, I'm > more saying that it is confusing that combine_distance controls how > far we look ahead when readahead_distance is low but when > readahead_distance is high, it controls when we issue the IO and not > how far we look ahead. I don't think we should change course now, but > I wanted to call out that this felt a little uncomfortable to me. I'm not sure I see an alternative. I tried to at least improve the comments around this. Attached are a revised set of commits. The largest changes are: - Reordered the series to put "read_stream: Only increase read-ahead distance when waiting for IO" after "stream: Split decision about look ahead for AIO and combining" Previously I thought it'd be too awkward from a comment perspective, but there's only one comment where it is a bit odd. Think it's much clearer this way. - Largely rewrote "Hacky implementation of making read_stream_reset()/end() not wait for IO". Looks a lot saner now. Think this needs a few more tests, in particular for the read stream and foreign_io paths. Will do that in the next version. - Tried to address most of Bilal's and Melanie's feedback - Removed some redundant checks from read_stream_should_issue_now() - Lots of comment polishing, including revising the top-level read_stream.c comment Greetings, Andres Freund --mq5mpczlv4j3v4zj Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="v5-0001-aio-io_uring-Trigger-async-processing-for-large-I.patch"