Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7bag-005WHy-2V for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 16:05:19 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w7baf-00BATc-0y for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 16:05:17 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7bae-00BATU-0U for pgsql-hackers@lists.postgresql.org; Tue, 31 Mar 2026 16:05:17 +0000 Received: from fout-b3-smtp.messagingengine.com ([202.12.124.146]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w7baa-00000002CNC-0cTT for pgsql-hackers@postgresql.org; Tue, 31 Mar 2026 16:05:16 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfout.stl.internal (Postfix) with ESMTP id 925231D0021F; Tue, 31 Mar 2026 12:02:53 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-03.internal (MEProxy); Tue, 31 Mar 2026 12:02:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:content-type:content-type:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:subject:subject:to:to; s=fm2; t=1774972973; x=1775059373; bh=vfdMYMJ7OqcA3NkPJt5YM/wvRvCzBvYA zeqq6Jhawu0=; b=nxLz14iExFVNjeT+Sz3mM2zEFRK736c6Hs6ikdyMyDjPJtZ1 S7KxsfIcBIXBHke2poMwKURC3VIWBRjBP10hA2o8VCNc5HZaMyiDGRCe+h30Q0Ew 6xf1Mua+JWsNHT/W7vyp3MRG8hJVAB6/U4S3N95a6sPdodnIWAdYY+XkpVpwFzVs qx6pqKhshjPihB2XsQKbG8SWgzJ82ImsHOKtpfOVgH67lLpTXl70dD+70eh3PItD r9KKrCywMK0pGbIcAf+ZlY+ND57Z2a3L1Qp4tdmr9RSCBIHVb2GGhLLXL7oQviOg LyoRi/5PYvhOTQUCMGgQi/idBXK8sscy35qHdw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1774972973; x= 1775059373; bh=vfdMYMJ7OqcA3NkPJt5YM/wvRvCzBvYAzeqq6Jhawu0=; b=p 8bEdCLI64bWS57BXD55RY9vvQtXPvJ6huVYUwLqt2QicpGrAtocOj7WGvjO/l1Xn ydN5jbfrOK9UEMwubtl3BJd/DVEoYZpX1ZN6xOGgVSSdZFSjO+J4afvyHzGmuW6W vklAtL36WKYiQ9UDuVfQczihMEaRu4q8kGXFuNV/QQPRBFudjDz5tzfmIBCJhbR2 w9GKj1g1lS16m8KEjnswxO7KnrB+lA/AL86B4uiTKaQu+rWQTocr7QFx9Xf+xea8 c4DTsDxIXS+JGNazo74fsAtwN7A5f+bFNYLnfz34+SMWS4kGEpUlK2wjkT3pbwCt orgrdTi6328Bm61ojZs9Q== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdeifecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegrihhl ohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpe ffhffvuffkgggtugesmhdtsfertddtvdenucfhrhhomheptehnughrvghsucfhrhgvuhhn ugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrthhtvghrnhepte eufffgueekfeekleefuefgieffgeejueejhfeutddukeegueetfffhhffgueefnecuffho mhgrihhnpehpohhsthhgrhdrvghsnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrg hmpehmrghilhhfrhhomheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgt phhtthhopeeipdhmohguvgepshhmthhpohhuthdprhgtphhtthhopehpghessghofihtrd hivgdprhgtphhtthhopehtvhesfhhuiiiihidrtgiipdhrtghpthhtohepsgihrghvuhii kedusehgmhgrihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnh esghhmrghilhdrtghomhdprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghi lhdrtghomhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehpohhsthhgrhgvsh hqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 31 Mar 2026 12:02:52 -0400 (EDT) Date: Tue, 31 Mar 2026 12:02:51 -0400 From: Andres Freund To: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Melanie Plageman , Tomas Vondra , Nazir Bilal Yavuz Subject: AIO / read stream heuristics adjustments for index prefetching Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="biispz4aupeya6ku" Content-Disposition: inline List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --biispz4aupeya6ku Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi, The index prefetching patchset [1] contains a few adjustments to the read stream logic for readahead. It seemed better to discuss them separately than in that already very large thread. The first two patches are also a dependency of the explain read stream patches [2]. There are two main areas that prefetching of table data during an index scan is more sensitive to than existing read stream users: 1) Prefetching for index scans is much more sensitive to doing too aggressive read ahead due to plans that involve running index scans to partial completion, rather than full completion. Consider e.g. nestloop antijoins or such, where the scan on the inner side will be started but often not completed. If we unnecessarily read ahead too aggressively, a lot of IO could be wasted. While it's of course possible to have partially consumed read streams with sequential scans or bitmap heap scans, it's not as common / cost sensitive. For seqscans it likely mostly happens with a LIMIT above the seqscan, but that probably won't be happening many times within a query on a table of any size. For bitmap heap scans it's not as common because the startup cost, i.e. the building of the bitmap, is far from cheap, doing that over and over does not make a lot of sense. 2) Prefetching for index scans is much more likely to have complicated mixes of hits and misses. Whereas a seqscan or a bitmap heap scan accesses each table block exactly once, with index scans its very common to have repeated accesses to some table blocks, while still having misses on other blocks. This means that index scans are more sensitive to patterns of hits and misses decreasing the readahead distance so much that we don't do aggressive enough readahead to avoid waiting for IO anymore. While more pronounced with index prefetching, it was already an issue with the existing users, particularly for bitmap heap scans. In fact, a similar patch to what's included here was first discussed somewhere around the BHS prefetching work. There's a few sets of changes here: 0001+0002: Return whether WaitReadBuffers() needed to wait The first patch allows pgaio_wref_check_done() to work more reliably with io_uring. Until now it only was able to return true if userspace already had consumed the kernel's completion event, but returned false otherwise. That's not really incorrect, just suboptimal. The second patch returns whether WaitReadBuffers() needed to wait for IO. This is useful for a) instrumentation like in [2] and b) to provide information to the read_stream heuristics to control how aggressive to perform read ahead. 0003: read_stream: Issue IO synchronously while in fast path When read stream is in fast path mode (where it short-circuits the read ahead logic, to reduce CPU overhead in s_b resident workloads) and encounters a miss, we until now performed the read asynchronously. Unfortunately, with worker, that can lead to slowdowns, because dispatching to workers has a latency impact. When doing "real" readahead, that's a price worth paying, because the latency should be hidden by issuing the reads early enough. But when just coming out of fast path mode, we're not ahead of what's needed, so the dispatch latency can't be hidden. We already have infrastructure to mark IOs to be executed synchronously. So we just need to use that here. 0004: read_stream: Prevent distance from decaying too quickly This, quite simple, patch reduces issue 2) from above, by preventing the look-ahead distance from being reduced for #maximum lookahead distance blocks after each miss. While this may seem overly aggressive, a single effectively synchronous read can take a long time compared to the CPU time needed for processing pages hits. On cloud storage the IO latency is somewhere between 0.5ms and 4ms. A halfway modern CPU can do a few heap_hot_search_buffer()s on 1000s of pages within 1 ms. While this one is my patch, several others have written variations of it before. We should probably have committed one already. There are two minor questions here: - Should read_stream_pause()/read_stream_resume() restore the "holdoff" counter? I doubt it matters for the prospective user, since it will only be used when the lookahead distance is very large. - For how long to hold off distance reductions? Initially I was torn between using "max_pinned_buffers" (Min(max_ios * io_combine_limit, cap)) and "max_ios" ([maintenance_]effective_io_concurrency). But I think the former makes more sense, as we otherwise won't allow for far enough readahead when doing IO combining, and it does seem to make sense to hold off decay for long enough that the maximum lookahead could not theoretically allow us to start an IO. 0005+0006: Only increase distance when waiting for IO Until now we have increased the read ahead distance whenever there we needed to do IO (doubling the distance every miss). But that will often be way too aggressive, with the IO subsystem being able to keep up with a much lower distance. The idea here is to use information about whether we needed to wait for IO before returning the buffer in read_stream_next_buffer() to control whether we should increase the readahead distance. This seems to work extremely well for worker. Unfortuntely with io_uring the situation is more complicated, because io_uring performs reads synchronously during submission if the data is the kernel page cache. This can reduce performance substantially compared to worker, because it prevents parallelizing the copy from the page cache. There is an existing heuristic for that in method_io_uring.c that adds a flag to the IO submissions forcing the IO to be processed asynchronously, allowing for parallelism. Unfortunately the heuristic is triggered by the number of IOs in flight - which will never become big enough to tgrigger after using "needed to wait" to control how far to read ahead. So 0005 expands the io_uring heuristic to also trigger based on the sizes of IOs - but that's decidedly not perfect, we e.g. have some experiments showing it regressing some parallel bitmap heap scan cases. It may be better to somehow tweak the logic to only trigger for worker. As is this has another issue, which is that it prevents IO combining in situations where it shouldn't, because right now using the distance to control both. See 0008 for an attempt at splitting those concerns. 0007: Make read_stream_reset()/end() not wait for IO This is a quite experimental, not really correct as-is, patch to avoid unnecessarily waiting for in-flight IO when read_stream_reset() is done while there's in-flight IO. This is useful for things like nestloop antioins with quals on the inner side (without the qual we'd not trigger any readahead, as that's deferred in the index prefetching patch). As-is this will leave IOs visible in pg_aios for a while, potentially until the backends exit. That's not right. 0008: WIP: read stream: Split decision about look ahead for AIO and combining Until now read stream has used a single look-ahead distance to control lookahead for both IO combining and read-ahead. That's sub-optimal, as we want to do IO combining even when we don't need to do any readahead, as avoiding the syscall overhead is important to reduce CPU overhead when data is in the kernel page cache. This is a prototype for what it could look like to split those decisions. Thereby fixing the regression mentioned in 0006. One thing that's really annoying around this is that we have no infrastructure for testing that the heuristics keep working. It's very easy to improve one thing while breaking something else, without noticing, because everything keeps working. I'm wondering about something like a READ_STREAM_DEBUG_INSTRUMENT flag which would trigger providing information about the IOs and their schedule via the the per-buffer-data mechanism. That would allow test_aio's read_stream_for_blocks() to return that information, which in turn could be used to verify that we are doing IO combining and looking ahead far enough in some situations. Greetings, Andres Freund [1] https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com [2] https://postgr.es/m/6f541abf-f9e1-4830-93cc-4a849dbf2ecf%40vondra.me --biispz4aupeya6ku Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="v3-0001-aio-io_uring-Allow-IO-methods-to-check-if-IO-comp.patch"