Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwqJ9-00Gf2H-01 for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 23:34:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vwqI7-00EJxE-2d for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 23:33:39 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwqI7-00EJx6-1G for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 23:33:39 +0000 Received: from relay4-d.mail.gandi.net ([217.70.183.196]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vwqI3-00000001w6p-008l for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 23:33:38 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 314233E9B9; Sun, 1 Mar 2026 23:33:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1772408014; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a6XPBnfdU+QrWHpZaC8dy7fPe9yKQExRU0Kig1PPPGI=; b=pzy80NxAGUAvaaSvTfzfVpdTcPb01JddoPXz9LqtXUb7XEcgv32EXUqZ30u0lbuo4TcPwE mogRbsWiNl0ourX12SxOWPl1tlAJbzsHF5yyxJU2RfDMJcIgi8Ds0BeZqIcJ8sz4SvVws+ YSWllVPihl0+4QdNvEpxygbUUYit25O9DXHM6XygOoWBhvCM/xGsTZiSOzsK7bpfOqK25G ohrA5II8ozIL4jVqvwSU6GOLYfYG4wi/eaCJsSBxRQvkA4/wsR0d/oXUX66sJvdw3Ko62R 4VEQlKE8QWDXp4r6PuLZbtGC1uucZD+l3jMRo7Lh1Ksz6yqi9/+FMQ4i/QBbxw== Message-ID: Date: Mon, 2 Mar 2026 00:33:30 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Alexandre Felipe Cc: Andres Freund , Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> <7e707787-272a-4c52-b5f1-5ac990514ecc@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-Sasl: tomas@vondra.me X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvheeiuddvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthekredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnheptedvlefhkeduffdvieeileeguddvheeufefgtedvvdekfeegtdeufffhtdfhiefhnecuffhomhgrihhnpehpohhsthhgrhgvshhqlhdrohhrghdpvhhonhgurhgrrdhmvgenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpqhhiugepfedugedvfeefgfelueelpdhmohguvgepshhmthhpohhuthdpnhgspghrtghpthhtohepuddupdhrtghpthhtohepohgrlhgvgigrnhgurhgvfhgvlhhiphgvsehgmhgrihhlrdgtohhmpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhhohhmrghsmhhunhhrohesghhmrghilhdrtghom hdprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhm X-GND-State: clean List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 3/1/26 23:32, Alexandre Felipe wrote: > > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra > wrote: > > Hi, > > I've decided to run a couple tests, trying to reproduce some of the > behaviors described in your (Felipe's) messages. > > > Thank you, > I will look into this data later. I am impressed with the number of IO > workers  > you used, my test was typically with 3. > 3 is extremely low for an I/O bound system. It's our tradition to pick defaults that work even on tiny systems, but need tuning on actual non-toy systems :-( The number of workers limits the number of I/O happening concurrently. So with 3 workers, there'll be only 3 concurrent I/Os (which is not a lot). At which point it can be better to keep doing synchronous I/O from multiple backends. It's a trade off. Having more workers is not "free", so we can't have infinite number of those. io_uring doesn't have this limitation. > I'm not trying to redo the tests exactly, because (a) I don't have a M1 > machine, and (b) there's not enough details about the hardware and > configuration to actually redo it properly. > > > Well I was running on a M1 because this is what I have in front of me > but I know that any serious database will run on linux. >   Sure, but even then it'd be better to have more details about the hardware. "M1" doesn't really say much (especially to people who don't use Apple stuff very much). There's a range of M1-based systems, from MacBook Air/Pro, Mini, Studio, ... > > I've focused on quantifying the impact of a couple things mentioned in > the previous message: > >   > I will have a look into this later and compute the effect size. > > The test varies the following parameters: > > * buffered or direct I/O > * io_method = (worker | io_uring) > * shared_buffers = (128MB | 8GB) > * enable_indexscan_prefetch = (on | off) > * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) > > > > There are literally only 4 cases where prefetching does worse than > master, and those are for random data with distance limit 1. I claim > this is irrelevant, because it literally disables prefetching while > still paying the full cost (all 4 are for io_method=worker, where the > signal overhead can be high, so it's not a surprise). > >   > I agree with your claim, the idea of the distance limit was to separate > to have an idea of AIO overhead without the benefit of prefetch, because > I was seeing very similar results, but when I controlled the distance > the prefetch benefit became visible. And also the gradation would  > show if this has a U shape or the larger the distance the better the > performance. > OK, understood. I guess you could do very similar thing by adjusting effective_io_concurrency (although I admit it's not exactly the same thing). > It's a bit like buying a race horse, break its leg > and then complain it's not running very fast > >  😂 > > > The overhead of the instrumentation seems relatively small, probably > within 5% or so. That's a bit less than I expected, but I still don't > understand what this is meant to say us. It's measuring wall-time, and > it's no surprise that in an I/O-bound workload most of the time is spent > in functions doing (and waiting for) I/O. Like read_stream_next_buffer. > But it does not give any indication *why*. > > > Well, I was hoping to be able to create a self balancing mechanism > in read_stream_next_buffer > >  /* Do we have to wait for an associated I/O first? */ > if (stream->ios_in_progress > 0 && > stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index) > { >   // prefetch and increase the distance while we wait here > WaitReadBuffers(&stream->ios[io_index].op); >  ... > } > ... > // this call could be removed if we prefetched earlier. > read_stream_look_ahead(stream); > > > There same principle that guided the  >> Don't wait for already in-progress IO > patch. Here we should prioritise increasing the distance, and if it is not > possible (maybe we consumed all the buffers). We could take the  > opportunity to yield. > > IIUC the idea would be to (automatically) increase the distance just enough so that the IOs complete right before we actually need the buffers? That reminds me the "IO cost model" I mentioned a couple days ago [1]. But it's not clear to me how this profiling helps with it. > > multi-client test (multi-client.tgz) > ------------------------------------ > > The test varies the following parameters: > > * buffered or direct I/O > * io_method = (worker | io_uring) > * io_workers = (12 | 32) > * shared_buffers = (128MB | 8GB) > * enable_indexscan_prefetch = (on | off) > * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) > * number of parallel workers (1, 2, 4, 8) > >   >  Are parallel workers here clients issuing queries? > Yes, clients running queries. But workers in a parallel query have mostly the same effect, really. > This all seems perfectly fine to me. The bad behavior would be if the > prefetching gets slower than master, because that would be a regression > affecting users. But that happens only in 4 cells in the table. > > > And in this case we have to take the other extremum, and run the queries > where the prefetch is not expected to help. In this sense I agree with Peter > that the yielding logic is important. We may be limiting the potential > of the > prefetch in some cases but excessive reads is the highest risk in my > opinion. > You may know better than me, but I talk about the workloads I have seen > or worked with, it is typically a high number of small queries. Not > these huge > scans.  > Large queries are rare, and when they come to our attention is because > they used too much memory and started to create temporary files. > You mean in the 4 cells that show regressions? Maybe. Or maybe we could look for ways to make the communication / signalling cheaper, to not overwhelm the workers like this. The problem with addressing this by yielding is that the query has no idea whether it runs alone, or with many other queries. There's no such context available in the read_stream, so it can't consider this when yielding. Moreover, why would yielding help with this? Yielding will just schedule some of the I/O later, how would it not make it *even slower*? Notice that in the multi-client-full.pdf the regression happens even with limit=1, at which point you *can't* yield anymore. > (But I'm speculating, I haven't investigated this in detail yet.) > > > Fair enough. > > Moreover, io_uring does not have this issue. Which is another indication > it's something about the signal overhead. > >   > That is interesting. >   > > In any case, these results clearly prefetching can be a huge improvement > even in environments with concurrent activity, etc. > > > If you see something different on the Mac, you need to investigate why. > It could be something in the OS, or maybe it it's hardware specific > thing (consumer SSDs can choke on too many requests). Hard to say. I > don't even know what kind of M1 machine you have, what SSD etc. > > > My guess is that the cause is IPC, I don't know well how the  > async IO works, but if it is a different process I think that MacOS is > by less efficient than linux.  But I don't know how to measure that. > I wrote about the signal overhead a couple months ago, with a simple benchmark simulating it [2]. I also wrote a brief explanation [3] about the AIO in PG18, which mentions that too. [1] https://www.postgresql.org/message-id/433c7359-3c5d-4fe3-a488-93a17f44d0bf%40vondra.me [2] https://www.postgresql.org/message-id/1c9302da-c834-4773-a527-1c1a7029c5a3%40vondra.me [3] https://vondra.me/posts/tuning-aio-in-postgresql-18/ regards -- Tomas Vondra