Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8KGt-000QPp-1d for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 15:47:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8KGq-006fcL-1X for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 15:47:48 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8KGp-006fcB-1M for pgsql-hackers@lists.postgresql.org; Thu, 02 Apr 2026 15:47:48 +0000 Received: from fout-b7-smtp.messagingengine.com ([202.12.124.150]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w8KGl-00000000DtP-4BGg for pgsql-hackers@postgresql.org; Thu, 02 Apr 2026 15:47:47 +0000 Received: from phl-compute-07.internal (phl-compute-07.internal [10.202.2.47]) by mailfout.stl.internal (Postfix) with ESMTP id D64D41D001A0; Thu, 2 Apr 2026 11:47:40 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-07.internal (MEProxy); Thu, 02 Apr 2026 11:47:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1775144860; x=1775231260; bh=2kOY81pTJ4NqvP/cEXOxOcBXqY8ptorOlG6021zfxnc=; b= kQrCY3grHe0+yWnSRY3DgWhPOvOsxqYJjqT33GF2qkx4kw2Q9tnSN/GSF6iPS6FA 6iftKgT2dmKbE4qg2andUQTwN/DXGLuC//iHneqBBu7HauZ4YgtDC/It12q7DJcd gD5x3nmGjeZPa17IlaTthUnQ0YPzGHhsS7BixelnO+biT57jcgtyDuWDUvgGapUW +hCRNFri2jdfc2LKJgZ5MZOPQTgFv6U3Heyu1enbetRLN09HMxMVs362M5nyJ0kc whc0hrUlsRn8J0SqMRp+ubWeHXyUtolfDWs3cwe0zn+wzQHzWyrlkA3wLQAoXpPE VGpDcGqR3rFhh2e/EXZc+Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1775144860; x= 1775231260; bh=2kOY81pTJ4NqvP/cEXOxOcBXqY8ptorOlG6021zfxnc=; b=N 1QjMyzBG10jbtZleHMBu+SpJtHCOZ/Nx6cmjTIrC0Nj8JMb7foucKOZvcBBtqivR QNRBr8mlkZeh31naI/CI+5D5pMGixRspn3YAsQ5pv+m0PHx0InAkHAyRF2bkyAlz sfsLpw1b3Vvf10+8wEnxBqBtewxEL0Mvx1HnXTmzvMZyTaATp96os0D94kRKiglR 1MY9IMqUCcNF1RA2a3YFwlbQgUWM5s2ooUyZLX6RKxz87Q0NWD/oDs3L4ypsPzGD a4IDrGGer+iFB73b6WtYQMFIz/F5c0CkPcUAbV6YqQPLGB1F9sIMWTgMbA/3tmVC 4efB1k7cM/ouzvXMwdU/w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdeigeefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh epfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgvshcu hfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrghtth gvrhhnpedtleelvdfgjedvffeiueekfeeuleffhfegfffhgfffkeevueehieehhfeigffh vdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnh gurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepiedpmhhouggvpehs mhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhvse hfuhiiiiihrdgtiidprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdp rhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtg hpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohep phhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 2 Apr 2026 11:47:39 -0400 (EDT) Date: Thu, 2 Apr 2026 11:47:39 -0400 From: Andres Freund To: Melanie Plageman Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Subject: Re: AIO / read stream heuristics adjustments for index prefetching Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-04-02 10:31:50 -0400, Melanie Plageman wrote: > On Tue, Mar 31, 2026 at 12:02 PM Andres Freund wrote: > > > > 0005+0006: Only increase distance when waiting for IO > > > > Until now we have increased the read ahead distance whenever there we > > needed to do IO (doubling the distance every miss). But that will often be > > way too aggressive, with the IO subsystem being able to keep up with a > > much lower distance. > > > > The idea here is to use information about whether we needed to wait for IO > > before returning the buffer in read_stream_next_buffer() to control > > whether we should increase the readahead distance. > > > > This seems to work extremely well for worker. > > > > Unfortuntely with io_uring the situation is more complicated, because > > io_uring performs reads synchronously during submission if the data is the > > kernel page cache. This can reduce performance substantially compared to > > worker, because it prevents parallelizing the copy from the page cache. > > There is an existing heuristic for that in method_io_uring.c that adds a > > flag to the IO submissions forcing the IO to be processed asynchronously, > > allowing for parallelism. Unfortunately the heuristic is triggered by the > > number of IOs in flight - which will never become big enough to tgrigger > > after using "needed to wait" to control how far to read ahead. > > On some level, relying on worker mode overhead feels fragile. If > worker overhead decreases—say, by moving to IO worker threads—we won't > be able to rely on this to keep the distance to an advantageous level. I don't see why lower overhead would prevent this from working? > If io_uring async copying is advantageous even when the consumer never > needs to wait, then it seems like parallelizing copying to/from the > kernel buffer cache will always be advantageous to do at some level. It's not universally advantageous, unfortunately - there's a nontrivial increase in latency (and also some CPU) due to it. Which matters mostly when having a shallow look-ahead depth (like at the start of a stream), because then the latency impact will directly influence query performance. Setup: CREATE EXTENSION IF NOT EXISTS test_aio; CREATE EXTENSION IF NOT EXISTS pg_buffercache; DROP TABLE IF EXISTS pattern_random_pgbench; CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, 500)) AS pattern; workload: SET io_combine_limit = 1; SET effective_io_concurrency=1; SELECT pg_buffercache_evict_relation('pgbench_accounts'); SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1; (and then repeated for eic 2,4,8,16) eic time plain ms time w/ forced async 1 2.331 5.366 2 2.164 3.210 4 2.151 2.677 8 2.155 2.749 16 2.151 2.742 32 2.141 2.732 64 2.161 2.739 128 2.153 2.652 Note that forced async never quite catches up. If I instead make the pattern 50k blocks long: eic time plain ms time w/ forced async 1 210.678 454.132 2 209.210 281.452 4 208.775 198.496 8 208.755 198.131 16 209.477 195.799 32 203.497 183.297 64 203.002 173.799 128 202.885 166.548 > The case where it is not (as you've stated before) is when the > consumer doesn't need the extra blocks, so it is just wasted time > spent acquiring them. That's one reason, but as shown above, it's also that the increase in latency can hurt, particularly in the first few blocks (where we are ramping up the distance) and when effective_io_concurrency is too low to allow for a deep enough read-ahead to allow to hide the latency increase. > So, it feels odd to try and find a heuristic that allows the readahead > distance to increase even when the consumer is not having to wait. Do you still feel like that with the added context from the above? > I'm not saying we should do this for this release, but I'm just wondering if > in the medium term, we should try to find a better way to identify the > situation where async processing is not beneficial because the blocks won't > be needed. I think we certainly can do better than today with some help, e.g. from the planner, to identify cases where we should be more careful about reading ahead too far, e.g. due to being on the inner side of an nestloop antijoin. > > So 0005 expands the io_uring heuristic to also trigger based on the sizes > > of IOs - but that's decidedly not perfect, we e.g. have some experiments > > showing it regressing some parallel bitmap heap scan cases. It may be > > better to somehow tweak the logic to only trigger for worker. > > > > As is this has another issue, which is that it prevents IO combining in > > situations where it shouldn't, because right now using the distance to > > control both. See 0008 for an attempt at splitting those concerns. > > Yea, I think running ahead far enough to get bigger IOs needs to > happen and can't be based on the consumer having to wait. What do you think about the updated patch to achieve that that I posted? Greetings, Andres Freund