Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w84rj-000Bol-1C for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Apr 2026 23:20:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w84rh-002uWe-2e for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Apr 2026 23:20:50 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w84rg-002uWW-2Y for pgsql-hackers@lists.postgresql.org; Wed, 01 Apr 2026 23:20:49 +0000 Received: from fhigh-a1-smtp.messagingengine.com ([103.168.172.152]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w84rc-000000005zA-298T for pgsql-hackers@postgresql.org; Wed, 01 Apr 2026 23:20:48 +0000 Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfhigh.phl.internal (Postfix) with ESMTP id 547B314001F8; Wed, 1 Apr 2026 19:20:42 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-02.internal (MEProxy); Wed, 01 Apr 2026 19:20:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1775085642; x=1775172042; bh=ifWvK5ScvFXSodBt6sUwmZI8TzGHJWv+sgMX8l4OslY=; b= aNaeQIAoKOKsD3Iii6kyiIQ0SbLmOFI2tBkF0EuWHhAIAwjMi+PohFVtQZa2Ybtu WPx5ai0SV+q/sW7wfK/xh5vA4hB5tFWQgf59Yui/Vhusre8xDtNmHPr9Kli6v4Hf J73UVMYHuP84MJAwp4xDMsKu7/1eYGa7rivKqZSVJCBcJIrN9JcEFFtFmJ38ao85 W/sMtturBiQcMXNdrervHMw2+MiYpNXIZC4YTahycIO9SYBk6XespCAkBSNzmY7w gt9goCXAINs232yxcyi8TnlHSOHnCXSaIaqpoa88LBRYlmHxaIXECectcp9SiEe6 w9+ojn6ki6DJfpolCQ19mQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1775085642; x= 1775172042; bh=ifWvK5ScvFXSodBt6sUwmZI8TzGHJWv+sgMX8l4OslY=; b=j zQEoxf/Dtr372BxlNadYi8o79yeLc6SQqrMBBA+bgZrZe8bPCey8A03CFFlZ5q7D n6LF5QOQGQwCpe8HsrS94EIsMXAkKp690Pjs99QfifPjUgf0GV49zX9aIafN5rCG VmCQXUB1z85V4V3Q36ZDQZjDs4n5Zdt9QuJYyybspKALgtJl+MC998/t82WaX4c/ hPIPJc16zgH+uXRQjXXmK1sIP5ifxm6xFNCj+m4hb4ppM08Y+snXZ4fpdye7LndW uTOSM8OOouO7MQhTWrOwBTmv4qOdcje1lD/Iq6VRekMgEkAukFoazRvfUDr/7bO3 IEET2JMLbfhG80Guxd7WQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdeggeefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh epfffhvfevuffkfhggtggugfgjsehmkefsredttdejnecuhfhrohhmpeetnhgurhgvshcu hfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrghtth gvrhhnpeejudejfeffhffhgfdvteefteekveeflefgleejjeejtefhteehuefhvedtffev ueenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnh gurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepiedpmhhouggvpehs mhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhvse hfuhiiiiihrdgtiidprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdp rhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtg hpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohep phhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 1 Apr 2026 19:20:40 -0400 (EDT) Date: Wed, 1 Apr 2026 19:20:40 -0400 From: Andres Freund To: Melanie Plageman Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Subject: Re: AIO / read stream heuristics adjustments for index prefetching Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="ik3l34cft6pr4f53" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --ik3l34cft6pr4f53 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Hi, On 2026-03-31 16:59:14 -0400, Melanie Plageman wrote: > On Tue, Mar 31, 2026 at 12:02 PM Andres Freund wrote: > > > > 0005+0006: Only increase distance when waiting for IO > > In "aio: io_uring: Trigger async processing for large IOs" (0005), the > first sentence of the commit message is incomplete. Oops. > Is there any reason for both the io size and inflight IOs threshold to > be 4? If they should be the same, I think it would be better if this > was a macro. No, there's no real real they have to be the same. It's just what a bit of experimenting showed working independently for either. > This may not matter, but the old code checked in_flight_before > 5 > before incrementing if for the current IO. The new code counts it > after pushing the current IO onto the submission list. So the new way > is slightly more aggressive. Hm. True. Not sure it matters. I didn't really see a significant difference for anything between 3 and 7, it was only outside of that that I saw worse performance. I have done more validation with the new cutoff value than with the old one, so I'm ever so mildly inclined to use the value currently in the patch, but I won't at all insist on it. > > Unfortuntely with io_uring the situation is more complicated, because > > io_uring performs reads synchronously during submission if the data is the > > kernel page cache. This can reduce performance substantially compared to > > worker, because it prevents parallelizing the copy from the page cache. > > There is an existing heuristic for that in method_io_uring.c that adds a > > flag to the IO submissions forcing the IO to be processed asynchronously, > > allowing for parallelism. Unfortunately the heuristic is triggered by the > > number of IOs in flight - which will never become big enough to tgrigger > > after using "needed to wait" to control how far to read ahead. > > > > So 0005 expands the io_uring heuristic to also trigger based on the sizes > > of IOs - but that's decidedly not perfect, we e.g. have some experiments > > showing it regressing some parallel bitmap heap scan cases. It may be > > better to somehow tweak the logic to only trigger for worker. > > Trigger which logic only for worker, you mean only increasing the > distance when waiting? Yea. > > As is this has another issue, which is that it prevents IO combining in > > situations where it shouldn't, because right now using the distance to > > control both. See 0008 for an attempt at splitting those concerns. > > Even if you can't combine into a single IO, it seems like a low > distance is problematic because it degrades batching and causes us to > have to call io_uring_enter for every block (I think). I don't think it actually does change the situation around that significantly, because we already end up with "too few" IOs once we hit the distance maximum, as we'll submit another IO as soon as we can. I think we will eventually need some logic to only start submitting again once multiple IOs are possible. But that's another set of heuristics, so a project for another day :) In my experiments the batching was primarily useful to allow to reduce the command submission & interrupt overhead when interacting with storage, i.e. when actually doing IO, not just copying from the page cache. I have seen it help due to reducing syscalls too, but the amount of batching and/or combining seems to have a relatively low ceiling at which it stops helping. > Setting aside more complicated prefetching systems, what it seems like > we are saying is that for all "miss" cases (not in SB) a distance of > above 1 is advantageous (unless we are only doing 1 IO). I wonder if > there is something hacky we can do like not decaying distance below > io_combine_limit if there has been a recent miss or growing it up to > at least io_combine_limit if we aren't getting all hits. I think it's true that if IO execution was all that mattered, we would want a bit more IO in flight at all time. However looking ahead quite deeply also has costs: 1) The resowner and private refcount mechanisms take more CPU cycles if you have more buffers pinned 2) The CPU cache hit ratio goes down if there's a longer time between copying data into s_b and consuming it 3) If you have a scan that won't be consumed to completion, you're wasting more the deeper you look ahead This is actually not hard to show: SET max_parallel_workers_per_gather = 0; SELECT pg_buffercache_evict_relation('pgbench_accounts'); EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM pgbench_accounts WHERE abalance = 3; (on a tree that has the explain stuff, but not the patches in this thread applied) eic time in ms 1 1326.525 2 1325.240 4 1335.073 8 1343.440 16 1346.189 32 1356.598 64 1398.326 128 1635.081 256 1674.685 512 1677.264 1000 1680.050 This one mainly shows 2) from above, I think, but the others are measurable in other workloads. > > 0007: Make read_stream_reset()/end() not wait for IO > > > > This is a quite experimental, not really correct as-is, patch to avoid > > unnecessarily waiting for in-flight IO when read_stream_reset() is done > > while there's in-flight IO. This is useful for things like nestloop > > antioins with quals on the inner side (without the qual we'd not trigger > > any readahead, as that's deferred in the index prefetching patch). > > > > As-is this will leave IOs visible in pg_aios for a while, potentially > > until the backends exit. That's not right. > > Separating the problems: the handle slot exhaustion seems like it > could be solved by having the backend process discard IOs when it > needs one and there isn't any. I don't think it could lead to exhaustion of handles, pgaio_io_acquire() will call pgaio_io_wait_for_free() which will wait for the oldest IO. > The pg_aios view problems seem solvable with a flag on the IO like > "DISCARDED". But the buffers staying pinned is different. It seems > like you'll need the backend to process the discarded IOs at some > point. Maybe it should do that before idling waiting for input? The easiest way would be to actually leave the IOs registered with the resowner and have it wait for completion at command or transaction end if not already done. But we currently don't really do that with resowners in the !abort case. I'm not sure if anybody would mind doing differently here. Another approach would be to do it in AtEOXact_Aio(), but that would mean the IOs could hang around for a while. A third approach could be to do one of the above, but add some additional that go through in flight IOs and check if they completed, e.g. in pgaio_io_acquire_nb(). > When discarding IOs, I don't really understand why the foreign IO > path, doesn't just clear its own wait ref (not the buffer descriptor > one) and move on -- instead you have it wait. That'd be ok to do, I just didn't want to think about that more complicated and less common case :) > I haven't finished reviewing 0008 yet. I've attached a version of what was 0008 split into two, one to introduce the new helpers, one to introduce the separate combine distance. I've pushed what was 0001 and 0002. Will push the former 0003 shortly. Greetings, Andres Freund --ik3l34cft6pr4f53 Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="v4-0001-read_stream-Issue-IO-synchronously-while-in-fast-.patch"