Feedback-ID: id4a34324:Fastmail
Date: Mon, 16 Feb 2026 11:16:57 -0500
From: Andres Freund <andres@anarazel.de>
To: Tomas Vondra <tomas@vondra.me>
Cc: Alexandre Felipe <o.alexandre.felipe@gmail.com>, 
	Peter Geoghegan <pg@bowt.ie>, Thomas Munro <thomas.munro@gmail.com>, 
	Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>, 
	Melanie Plageman <melanieplageman@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, 
	Georgios <gkokolatos@protonmail.com>, Konstantin Knizhnik <knizhnik@garret.ru>, 
	Dilip Kumar <dilipbalaut@gmail.com>
Subject: Re: index prefetching
Message-ID: <heav35ejt5bm6lqbeqdfksmkh3nqwcmfkhlqifoqfkzhrcaj2a@uugpdzay6tor>
References: <CAH2-Wz=6a7fGz2rALDX+xiFDuEaGQWpZ49xEaBUDKiPH8gcL+Q@mail.gmail.com>
 <CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com>
 <CAH2-WzmymSyOt5Y2RGbm6cJXg18J_ttfqjdcpodHe6Gp23ConQ@mail.gmail.com>
 <CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com>
 <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <9411f220-007d-4f1e-9c8f-ca8eb09e6788@vondra.me>
 <CAE8JnxNvssQBKJPmVhkeQBk_mEi3deR_ij10+7MAN+6y9WFDrQ@mail.gmail.com>
 <984dcf9e-ada0-4dff-ae58-1f97bc904ccb@vondra.me>
 <7herwtpae3ptqdng3s7tcft4ljkc23fyocp3mbrvc7xyk7s2lk@uq3qbm4blizo>
 <433c7359-3c5d-4fe3-a488-93a17f44d0bf@vondra.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <433c7359-3c5d-4fe3-a488-93a17f44d0bf@vondra.me>
Archived-At: <https://www.postgresql.org/message-id/heav35ejt5bm6lqbeqdfksmkh3nqwcmfkhlqifoqfkzhrcaj2a%40uugpdzay6tor>
Precedence: bulk

Hi,

On 2026-02-16 16:44:03 +0100, Tomas Vondra wrote:
> On 2/15/26 23:39, Andres Freund wrote:
> > The problem that this fixes is that the periodic workload has cache hits
> > frequently, which reduce the stream->distance by 1. Then, on a miss, we double
> > the distance.  But that means that if you have the trivial pattern of one hit
> > and one miss, which this workload very often has, you *never* get above
> > 1. I.e. we increase the distance as quickly as we decrease it.
> > 
> 
> I've described this exact behavior a couple months ago in this very
> thread. The queue of batches has limited size, the prefetch needs to be
> paused - at that point read_stream_pause did not exist, so it was done
> by terminating the stream and then read_stream_reset to "resume" it.
> 
> That has the unfortunate effect that it resets distance to 1, and so it
> easily led exactly to the issue you describe. But without the pausing it
> would work perfectly fine, in many cases.
> 
> The way I'm thinking about it is that each data set with a uniform cache
> miss rate has a distance threshold for stability. Let's say the miss
> rate is 1%. That means that we'll do an I/O and then the next 99 blocks
> are cache hits. So we double the distance, and then decay it by -1. To
> make this stable, these two things need to balance. If we start at
> distance=99, we'll go 198->99->198->99->198->... But if we start just a
> little bit lower, at 98, we'll quickly decay to distance=1. This is what
> I call the "stability threshold".
> 
> With the resetting, this effect is pretty brutal. The queries often have
> a period of increased "cache misses" at the beginning (because caches
> are cold). The distance can "climb up" above the threshold, and keep up
> sufficient prefetch distance. But the stream reset sets the distance=1,
> without the cache misses to push it up again.
> 
> Having read_stream_pause/resume solves this particular issue, but the
> overall issue of "stability threshold" still exists. Even when the
> distance gets adjusted differently (e.g. 2*d + 1), it'll still be
> trivial to construct "nice" data sets that narrowly miss the threshold.
> I don't see a way around that, unfortunately (short of not decaying the
> distance at all).

I think this specific issue is a bit different, because today you get
drastically different behaviour if you have

a) [miss, (miss, hit)+]
vs
b) [(miss, hit)+]

because for a), the first miss will get to distance = 2, the second one to 4,
and then we go to 3, to 6, to 5, to 10, ... I.e. things will work out nicely.

But today, for b), we just oscillate between 1 and 2. Despite the cache
hit/miss pattern being almost entirely the same. That's terrible.


I think we need to fix this, even if there are more complicated scenarios that
won't be addressed. This is a tiny difference in hit pattern having a huge
performance impact, without there being any reason to behave that way.

It doesn't have to be
    distance = distance * 2 + 1
I'd just has happy with
    distance = Min(distance * 2, 4)

or such. We just need to get out of the situation that there's a different
behaviour when distance = 1 than otherwise.


I'd probably go further and make it so that the distance does not decay as
long as there is any in-flight IO. I think that makes a lot of sense, if
looking ahead allows to start IOs earlier, looking ahead is worthwhile. What
the distance-- is trying to achieve is to avoid continuing to do deep
lookahead when you start having a high cache hit ratio (e.g. because your
entire table now is in memory). Whether anything in our lookahead distance
actually needed to do IO is a pretty good predictor.


> I think a "proper" solution would require some sort of cost model for
> the I/O part, so that we can schedule the I/Os just so that the I/O
> completes right before we actually need the page.

I agree that that would be really useful (I've been collaborating with some
folks from microsoft research towards developing something for that, FWIW).,
but i think that's actually somewhat orthogonal. The main gain of something
like that would be that it would avoid increasing the lookahead distance
further, past where it can help. Right now we'll happily increase the distance
to the max with a lot of misses, even though that's very often not going to
increase throughput after some point.

The reason I think such an algorithm is orthogonal to the problem I described
is that even if we had such a cost model, you actually need to start out with
heuristics, at the start of the scan there is no information with which to
guide such a scheduler.  The actual IO pattern and the user of the stream have
a *huge* impact on how aggressively you need to read ahead, and that's not
known at the start of the stream. Yet you need to start ramping up readahead
at the start of the stream, not after thousands of IOs.

Lastly, improving our pretty naive heuristics until we have such a model is
helpful, even if there are always going to be cases where the heuristics don't
work out.


> Of course, that's a super simplified model. The point is it needs some
> concept of actual I/O costs, I don't think there's a perfect formula
> considering only the distance itself.

Clearly there is not - but I don't think there needs to be perfection, just
being noticeably better than what we have today is good enough.

Greetings,

Andres Freund