Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8IAl-000ODU-1S for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 13:33:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8IAk-00656I-0k for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 13:33:22 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8IAj-006569-0Y for pgsql-hackers@lists.postgresql.org; Thu, 02 Apr 2026 13:33:22 +0000 Received: from fout-b2-smtp.messagingengine.com ([202.12.124.145]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w8IAh-00000000Bwd-0gu5 for pgsql-hackers@postgresql.org; Thu, 02 Apr 2026 13:33:20 +0000 Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfout.stl.internal (Postfix) with ESMTP id DD3311D00285; Thu, 2 Apr 2026 09:33:17 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Thu, 02 Apr 2026 09:33:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1775136797; x=1775223197; bh=P6BP7NFCJC /fm1cXrWIOv9gGltJBquuRuEZdDfR+L5g=; b=E7P+tLyZuynpzUu0PbIGjyUjuY o5HS7TtXybi2oXROVgdox3ajRS7WbnIbQ+3TbEfFrmKOYDLCXJu1I2NvqiHxf5zA 4yBHARgiGFoSIoO8WXmATdUC4g5Cj+sG8M982gAJjoaMwEasL0Jw1SNS1ARZJwvA OAMV7GVNB4xOOwFE6lvkiWvJDDsT2O7ywgr5F5woxMECrJ38tT/8qXOs6BfLkHjl oCZm8EPnIoaQ36eKnNYnH3D7xEaP2TuB+J7TsvuoAEPMp9BsQgIQi7szay9zf7Ve cdYSMQ1ZGeH0phUI1uu/OY3td5pvELdHJxnVJ6idhnMX69mHKvo+4RFf2JRQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1775136797; x=1775223197; bh=P6BP7NFCJC/fm1cXrWIOv9gGltJBquuRuEZ dDfR+L5g=; b=N28/O2jTFJiDbqYwK6fa0nDYtR9HZY6xMLoHZn0imfp8OmxW5u0 eeEgSd0iDL2x2LM6W56sN61WpQWnDRySCD1wDTwjqn5zXw6GnIWoWMsQZylWs7TW DpDyAylf/EjKHPvh8jF3Tv9cA1QupDgwQbZSytmPyKaIkQulknEU1Q64tNKJERF6 UItnQMKFt8J7og7vMI8eOd00juEpq/5Et0qhRO5M2QlxHws5QZqJXmq2cyH+Ne/Q X/eJ1Rg/vp1M8BUNKem1o98FUILgMROVJQ6y0I0qyEdoSkv0wNhuv9P7JIDk0uSs LkSVP4lxWa+q5KN9e6pSnT27xlzoWa+G3kA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdeiudeiucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh epfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghsucfh rhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrthhtvg hrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefhueeg necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprghnug hrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeeipdhmohguvgepshhm thhpohhuthdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopehtvhesfh huiiiihidrtgiipdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhr tghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtghomhdprhgtph htthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehp ghhsqhhlqdhhrggtkhgvrhhssehpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 2 Apr 2026 09:33:17 -0400 (EDT) Date: Thu, 2 Apr 2026 09:33:16 -0400 From: Andres Freund To: Nazir Bilal Yavuz Cc: Melanie Plageman , pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra Subject: Re: AIO / read stream heuristics adjustments for index prefetching Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-04-02 15:30:26 +0300, Nazir Bilal Yavuz wrote: > 0002: read_stream: Prevent distance from decaying too quickly > > + /* > + * As we needed IO, prevent distance from being reduced within our > + * maximum look-ahead window. This avoids having distance collapse too > + * quickly in workloads where most of the required blocks are cached, > + * but where the remaining IOs are a sufficient enough factor to cause > + * a substantial slowdown if executed synchronously. > + * > + * There are valid arguments for preventing decay for max_ios or for > + * max_pinned_buffers. But the argument for max_pinned_buffers seems > + * clearer - if we can't see any misses within the maximum look-ahead > + * distance, we can't do any useful read-ahead. > + */ > + stream->distance_decay_holdoff = stream->max_pinned_buffers; > > That is already committed but I have a question. Did you think about > setting stream->distance_decay_holdoff to current stream->distance? > This will also fix 'miss followed by a hit' (it won't fix double > missed followed by a hit, though). I am fairly certain that would not work well, because the whole point of the mechanism it to prevent the distance from staying too small when the distance is low. We want the distance to grow for cases like mis, hit{10}, miss, ... because that will typically still be bottlenecked by IO latency. And something like setting the decay holdoff to the current distance wouldn't work for that. > 0003: aio: io_uring: Trigger async processing for large IOs > > There is a small optimization opportunity, we don't need to calculate > io_size for the DIO since pgaio_uring_should_use_async() will always > return false. Do you think it is worth implementing this? Other than > that, LGTM. I doubt it. Compared to actually doing IO the cost of this is neglegible. > 0005: WIP: read_stream: Move logic about IO combining & issuing to helpers > > /* never pin more buffers than allowed */ > if (stream->pinned_buffers + stream->pending_read_nblocks >= > stream->max_pinned_buffers) > return false; > > /* > * Don't start more read-ahead if that'd put us over the distance limit > * for doing read-ahead. > */ > if (stream->pinned_buffers + stream->pending_read_nblocks >= > stream->distance) > return false; > > AFAIK, stream->distance can't be higher than > stream->max_pinned_buffers [1], so I think we can remove the first if > block. If we are not sure about [1], stream->max_pinned_buffers most > likely will be higher than stream->distance, it would make sense to > change the order of these. > > Aha, I understood this change after looking at 0006. It is nitpick but > 'if (stream->pinned_buffers + stream->pending_read_nblocks >= > stream->max_pinned_buffers)' block can be added in 0006. Other than > these, LGTM. Hm. I guess that could make sense. > 0006: WIP: read stream: Split decision about look ahead for AIO and combining > > I liked the idea of being more aggressive to do IO combining. What is > the reason for gradually increasing combine_distance, is it to not do > unnecessary IOs at the start? Yea. It'd perhaps not be too bad with the existing users, but it'd *really* hurt with index scan prefetching, because of query plans where we only consume part of the index scan (e.g. a nested loop antijoin). > + /* > + * XXX: Should we actually reduce this at any time other than > + * a reset? For now we have to, as this is also a condition > + * for re-enabling fast_path. > + */ > + if (stream->combine_distance > 1) > + stream->combine_distance--; > > I don't think we need to reduce this other than reset. Hm. I go back and forth on that one :) > + /* > + * Allow looking further ahead if we have an the process of building a > + * larger IO, the IO is not yet big enough and we don't yet have IO in > + * flight. Note that this is allowed even if we are reaching the > + * read-ahead limit (but not the buffer pin limit). > + * > + * This is important for cases where either effective_io_concurrency is > + * low or we never need to wait for IO and thus are not increasing the > + * distance. Without this we would end up with lots of small IOs. > + */ > + if (stream->pending_read_nblocks > 0 && > + stream->pinned_buffers == 0 && > + stream->pending_read_nblocks < stream->combine_distance) > + return true; > > Do we need to check 'stream->pinned_buffers == 0' here? I think we can > continue to look ahead although there are pinned buffers as we already > know 'stream->pinned_buffers + stream->pending_read_nblocks < > stream->max_pinned_buffers'. We could, but I don't think there would be a benefit in doing so. In my mind, what combine_distance is used for, is to control how far to look ahead to allow for IO combining when the readahead_distance is too low to allow for IO combining. But if pinned_buffers > 0, we already have another IO in flight, so the combine_distance mechanism doesn't have to kick in. I've been experimenting with going the other way, by having read_stream_should_look_ahead() do a check like /* * If we already have IO in flight, but are close enough to to the * distance limit that we would not start a fully sized IO, don't even * start a pending read until later. * * This avoids calling the call thing the next block callback in cases we * would not start the pending read anyway. For some users the work to * determine the next block is non-trivial, so we don't want to do so * earlier than necessary. * * A secondary benefit of this is that some callers use parallel workers * with each their own read stream to process a global list of blocks, and * only calling the next block callback when ready to actually issue IO * makes it more likely for one backend to get consecutive blocks. */ if (stream->pinned_buffers > 0 && stream->pending_read_nblocks == 0 && stream->pinned_buffers + stream->combine_distance >= stream->readahead_distance) return false; I do see that improve performance with some bitmap heap scan queries (e.g. Tomas' "cyclic" ones). But I'm worried it'd hurt other queries when using a low effective_io_concurrency. So that's probably something for later. > + /* same if capped not by io_combine_limit but combine_distance */ > + if (stream->combine_distance > 0 && > + pending_read_nblocks >= stream->combine_distance) > + return true; > > I think we don't need to check 'stream->combine_distance > 0', it > shouldn't be less or equal than zero when the > 'stream->readahead_distance' is not zero, right? Yea. I guess we could just assert that. > + { > + stream->readahead_distance = Min(max_pinned_buffers, > stream->io_combine_limit); > + stream->combine_distance = stream->io_combine_limit; > + } > > I think the stream->combine_distance should be equal to > stream->readahead_distance as we don't want to surpass > max_pinned_buffers. I don't think it could lead to exceeding max_pinned_buffers, due to the /* never pin more buffers than allowed */ check in read_stream_should_look_ahead(), but there's no reason to rely on that. Greetings, Andres Freund