Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs1HU-000upJ-1R for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 16:17:04 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vs1HT-003pqD-0p for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 16:17:03 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs1HS-003ppg-1K for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 16:17:03 +0000 Received: from fout-b3-smtp.messagingengine.com ([202.12.124.146]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vs1HP-00000000utx-1rpn for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 16:17:01 +0000 Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfout.stl.internal (Postfix) with ESMTP id 6461B1D0019D; Mon, 16 Feb 2026 11:16:59 -0500 (EST) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-06.internal (MEProxy); Mon, 16 Feb 2026 11:16:59 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1771258619; x=1771345019; bh=E0FxdOTGli UvMABI3Do1CTV+xIIHJoO4bv39FCgTce8=; b=wOxdB0Enkof3iDs4xytpO5DHpa aLnEZLKk6bqSB6JZMicHfqOz3gN04mA4zzKk4IcMVondPcxl0XKv1SK/Ud/l4icE T3lrERF+Nd+vnm8wQm8VMEt/ZjyE5Z+WR6W36QCWXzrBL+DGliuljud7WLBa45YY DdMtyCul1r7gbiawLp2BwkcJfuPi6326JNIAb4WEUjiSDL9fguMOIzr3bu6Z824N YEvTpEwciaKZDIk0CdSbghZ/VCT485fq4vvn/MXYuOkYyzu3KfyHOAopy6lHiaB5 J/N6ErlHRwVxerUYs1tNrwmzkogJDrwkYCzHaeeK2TV6pNKuOZKmVmjS6g9A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1771258619; x=1771345019; bh=E0FxdOTGliUvMABI3Do1CTV+xIIHJoO4bv3 9FCgTce8=; b=KaTIPQK7HUiEtfNPUE9Cnhp9z5sf3iwPE/f7b1QZhnGY4rywkig IVoOa4awkcnqluJBC9q5BC/17q8oieljVQ9ct1yE5Q6wMLnUhJo0cAerjgWMOEw3 KswIlLdEb0HUsD2jAHgZInwWmpoOmV7SXGOcVivQDWQ1VN1rxFRnB7qGYLVfHnoo 3u46AqNxpQ7XuIbfUVczqJrLEO9RDCp+3GIKp8rjHkOTToN5bz++hSW4WWgxWKBB bT0FCRt9rTnYhTXPP9Gx25TYsk8mu3nBR6jeUgrv+TyvDI3DTrQvuObFx1Ptn8lJ PFWD2VR4gaAkKj2e+4LY4Ptxl8GHi5D4CQg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvudejfeefucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeduuddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepkh hnihiihhhnihhksehgrghrrhgvthdrrhhupdhrtghpthhtohepsgihrghvuhiikedusehg mhgrihhlrdgtohhmpdhrtghpthhtohepughilhhiphgsrghlrghuthesghhmrghilhdrtg homhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhm pdhrtghpthhtohepohdrrghlvgigrghnughrvgdrfhgvlhhiphgvsehgmhgrihhlrdgtoh hmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphht thhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehpgh hsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 16 Feb 2026 11:16:58 -0500 (EST) Date: Mon, 16 Feb 2026 11:16:57 -0500 From: Andres Freund To: Tomas Vondra Cc: Alexandre Felipe , Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: <9411f220-007d-4f1e-9c8f-ca8eb09e6788@vondra.me> <984dcf9e-ada0-4dff-ae58-1f97bc904ccb@vondra.me> <7herwtpae3ptqdng3s7tcft4ljkc23fyocp3mbrvc7xyk7s2lk@uq3qbm4blizo> <433c7359-3c5d-4fe3-a488-93a17f44d0bf@vondra.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433c7359-3c5d-4fe3-a488-93a17f44d0bf@vondra.me> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-16 16:44:03 +0100, Tomas Vondra wrote: > On 2/15/26 23:39, Andres Freund wrote: > > The problem that this fixes is that the periodic workload has cache hits > > frequently, which reduce the stream->distance by 1. Then, on a miss, we double > > the distance. But that means that if you have the trivial pattern of one hit > > and one miss, which this workload very often has, you *never* get above > > 1. I.e. we increase the distance as quickly as we decrease it. > > > > I've described this exact behavior a couple months ago in this very > thread. The queue of batches has limited size, the prefetch needs to be > paused - at that point read_stream_pause did not exist, so it was done > by terminating the stream and then read_stream_reset to "resume" it. > > That has the unfortunate effect that it resets distance to 1, and so it > easily led exactly to the issue you describe. But without the pausing it > would work perfectly fine, in many cases. > > The way I'm thinking about it is that each data set with a uniform cache > miss rate has a distance threshold for stability. Let's say the miss > rate is 1%. That means that we'll do an I/O and then the next 99 blocks > are cache hits. So we double the distance, and then decay it by -1. To > make this stable, these two things need to balance. If we start at > distance=99, we'll go 198->99->198->99->198->... But if we start just a > little bit lower, at 98, we'll quickly decay to distance=1. This is what > I call the "stability threshold". > > With the resetting, this effect is pretty brutal. The queries often have > a period of increased "cache misses" at the beginning (because caches > are cold). The distance can "climb up" above the threshold, and keep up > sufficient prefetch distance. But the stream reset sets the distance=1, > without the cache misses to push it up again. > > Having read_stream_pause/resume solves this particular issue, but the > overall issue of "stability threshold" still exists. Even when the > distance gets adjusted differently (e.g. 2*d + 1), it'll still be > trivial to construct "nice" data sets that narrowly miss the threshold. > I don't see a way around that, unfortunately (short of not decaying the > distance at all). I think this specific issue is a bit different, because today you get drastically different behaviour if you have a) [miss, (miss, hit)+] vs b) [(miss, hit)+] because for a), the first miss will get to distance = 2, the second one to 4, and then we go to 3, to 6, to 5, to 10, ... I.e. things will work out nicely. But today, for b), we just oscillate between 1 and 2. Despite the cache hit/miss pattern being almost entirely the same. That's terrible. I think we need to fix this, even if there are more complicated scenarios that won't be addressed. This is a tiny difference in hit pattern having a huge performance impact, without there being any reason to behave that way. It doesn't have to be distance = distance * 2 + 1 I'd just has happy with distance = Min(distance * 2, 4) or such. We just need to get out of the situation that there's a different behaviour when distance = 1 than otherwise. I'd probably go further and make it so that the distance does not decay as long as there is any in-flight IO. I think that makes a lot of sense, if looking ahead allows to start IOs earlier, looking ahead is worthwhile. What the distance-- is trying to achieve is to avoid continuing to do deep lookahead when you start having a high cache hit ratio (e.g. because your entire table now is in memory). Whether anything in our lookahead distance actually needed to do IO is a pretty good predictor. > I think a "proper" solution would require some sort of cost model for > the I/O part, so that we can schedule the I/Os just so that the I/O > completes right before we actually need the page. I agree that that would be really useful (I've been collaborating with some folks from microsoft research towards developing something for that, FWIW)., but i think that's actually somewhat orthogonal. The main gain of something like that would be that it would avoid increasing the lookahead distance further, past where it can help. Right now we'll happily increase the distance to the max with a lot of misses, even though that's very often not going to increase throughput after some point. The reason I think such an algorithm is orthogonal to the problem I described is that even if we had such a cost model, you actually need to start out with heuristics, at the start of the scan there is no information with which to guide such a scheduler. The actual IO pattern and the user of the stream have a *huge* impact on how aggressively you need to read ahead, and that's not known at the start of the stream. Yet you need to start ramping up readahead at the start of the stream, not after thousands of IOs. Lastly, improving our pretty naive heuristics until we have such a model is helpful, even if there are always going to be cases where the heuristics don't work out. > Of course, that's a super simplified model. The point is it needs some > concept of actual I/O costs, I don't think there's a perfect formula > considering only the distance itself. Clearly there is not - but I don't think there needs to be perfection, just being noticeably better than what we have today is good enough. Greetings, Andres Freund