Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwz94-008Mop-28 for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Mar 2026 09:00:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vwz93-00GJVd-14 for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Mar 2026 09:00:53 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwz92-00GJVP-2P for pgsql-hackers@lists.postgresql.org; Mon, 02 Mar 2026 09:00:53 +0000 Received: from mail-ej1-x62c.google.com ([2a00:1450:4864:20::62c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vwz8y-000000028O3-3h6N for pgsql-hackers@lists.postgresql.org; Mon, 02 Mar 2026 09:00:51 +0000 Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-b935b8dcab7so646355066b.1 for ; Mon, 02 Mar 2026 01:00:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772442047; cv=none; d=google.com; s=arc-20240605; b=i9NYc+t5LdLoS4ic3EayxspPnTuuSEaHKx/cuMY4IAya01Ax1rTpNKtMEpv3yGTghm t5PCPocmfyMk/esdvd1ztG2ULEo1JatqQF5o7+vOFWxVXZF0GXdO4s+qYtekf7g/dd6U ZvY4xL99jSsqz9Nf6ikWpLbk1Zd5gqg62N/BNiW9u1hhd+81b466EToZqGbHY9QDFYCQ yuNFQIuXJny1ziLc64TT1MHEIcoiMYzv3eioc2tqhC/m6UlQzVNxC4116pPJqSm8ACxK pfPJXi8bzTuAL6opmQUdzJGxDsa83dk5hnrdx/skNdBzxEyNXppByLFsT7BGCp354G/O zHrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=rRWEDsiwihE8eYyasYevGdK46YCxm1puwxynW2thT/w=; fh=FJbtjZjEos5p0vgLTPWYoR7xM/8ZRro/MuD9hy/owW8=; b=jwDwH9++2yAGEot7BG/eJVZQUosjRHjK6SAtKE11r+An8ZmxL064qhXfCqZVPtKAqf n+24b0O9pVnCb3Mz9i8naGjbDIFRESI7QlO7+SpaEcDe/efIWG51YRj1+Y2fULQsgyMz GnuFCTB/LeYVzUOh6u7wsUZwIWoeCp1kZF4ZCHh2PCCLNhfPuYhO4+tJYBIOf3+WF6IE hPlA5PCcheSUMNAs6pBWsbROkEtuymJVcy7+HPI6vvq/rfAFiUwGufKj6DoWkcKuDsoj d8tVD04FX5knbCxFBE1Cl1vR8bUsZyEQYlXwAP3pUxAVwJ6enoDeFH4cxWdLGQYC5HdQ GH1g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772442047; x=1773046847; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=rRWEDsiwihE8eYyasYevGdK46YCxm1puwxynW2thT/w=; b=ezI102mp8RzkUVnJqA+kit30qvXTHxf7a+ByAVP2Lyg02mqJZHA6c8y3IYTzBQjlV3 m5iTUINHxoiq5TjmrEi5pi6jrKIjuY2rEuJJvK95e1baRvwmsa1/5+H/t3JzA/OFdBj5 QxgrFtf55bzto5riy7vNKUZsXryq6RHejfNiZEwFKKZjP1ON+DbpvjUNUNNOl24kSp60 TWzfhj/aOxQ9zpcTPqSHI3QYEZnJekA+PPcJXStwWOTzDdypkCDTL9QIu3anxuwroQtv skGrqIRuKQcreVHnGQBhUxhdmzk/MZeCDGEMbbchEntf3eYqDQe5llXtalbf/Az9wjFr NULg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772442047; x=1773046847; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=rRWEDsiwihE8eYyasYevGdK46YCxm1puwxynW2thT/w=; b=lmGzCL7KzWC7hmsIZv2FVhfojCq88yGOEBr+C2iOEjqYvjYejEVRJ92UXphiCmFvp/ dvEWsMzwWoRww6UU4MxCCfPz/Tq30IQUkmVNHWMDJOCkkrmB7Ou1E3KBHh0gGCyVCP4a HemaB9LrtNjx+BB0AVyK7xMfWk17MSF9P6An9N86gJzETbtJ6fh5E03Jwj4XRr12f+69 db3qaScjbTGQXRs6rE5V3/EiQFPwVWPXv05H+ksP4FjfbrnuxmH95avds7ticbJ+4U6m RgDBOUB7jAps6KOfozJPDbLNN1njDii5gnuwrjg5eFESiiQdzf9z5ryLsE7e7O8Qp4ZK qpRg== X-Forwarded-Encrypted: i=1; AJvYcCV0jywqDtArs8nDuofE6Sme+vAc+qxLnShh3BelvhVQ3Px5mtPhBxJY+1R6Nb/Vd5AS+WvoKXpE6BmbB9R7@lists.postgresql.org X-Gm-Message-State: AOJu0Yx6gqpEuZ8XpGlGLo2J3oFOkVxNBnuDHcyD+Lypy+aLqHwN8Juy H+pludABA+lkGCs5kKIxAXBBxNebHOiNRpOheh4LCFB1Z39WUkPp4/v1NtlVmgFDQmyjhpLJmO8 zTIYw4U4lOQL/F6kJRFCRNtr+QSHlXgM= X-Gm-Gg: ATEYQzwqWfM8LwPVKLgDQso+y2v3tGoJZ2/mbprp07W0M7NMweRkRq8dQrElo6IaFuV X/r77ShPfmHY6OYn7pWWjiusDbhrSUAhsH3/OswKXcfKr84JrfrLwpjK10SS8w+2+z226qrL251 K6k5+x+f+GOUoN+vAdZ9vwEHm5nMc0l3Bh6jwfbrGKdlrCIppzox6ExsW7gyHC0dPUyIsrKJxFa /+oSVqPvgVWBBMbe45/mZZJJc2NW+/sj0H1/tWLKrlKeJHHmUlZOn4TA7D9YkMXD/XH5MAAcMhA S5jEQquQaFj814x6s9elUCI/aPnbbDmKFP+gSB1BvpvUKIw7ZQ== X-Received: by 2002:a17:907:9307:b0:b8f:d960:c592 with SMTP id a640c23a62f3a-b93764cee1amr658577666b.33.1772442046668; Mon, 02 Mar 2026 01:00:46 -0800 (PST) MIME-Version: 1.0 References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> <7e707787-272a-4c52-b5f1-5ac990514ecc@vondra.me> In-Reply-To: From: Alexandre Felipe Date: Mon, 2 Mar 2026 09:00:34 +0000 X-Gm-Features: AaiRm520wTY-tSMnzhrwZvGhOCI3bs4R6AaxbUFmRJYHsgqpXz-SLHl7sndVcpc Message-ID: Subject: Re: index prefetching To: Tomas Vondra Cc: Andres Freund , Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: multipart/alternative; boundary="00000000000093a4da064c06d2c6" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000093a4da064c06d2c6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Mar 1, 2026 at 11:33=E2=80=AFPM Tomas Vondra wrot= e: > On 3/1/26 23:32, Alexandre Felipe wrote: > > > > On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra > > wrote: > > > > Hi, > > > > I've decided to run a couple tests, trying to reproduce some of the > > behaviors described in your (Felipe's) messages. > > > > > > Thank you, > > I will look into this data later. I am impressed with the number of IO > > workers > > you used, my test was typically with 3. > > > > 3 is extremely low for an I/O bound system. It's our tradition to pick > defaults that work even on tiny systems, but need tuning on actual > non-toy systems :-( > That is was a surprise for me, because I am used to javascript that does everything in one single process (with a coroutine async model) and does with very little overhead. Cold Cache (buffer eviction before each run): io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<=3D128 3 0.68s 1.20s 1.29s 0.78s 0.68s 10 0.75s 1.02s 1.51s 1.62s 0.82s 30 0.75s 0.79s 2.95s 1.65s 1.43s Warm Cache (no eviction): io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<=3D128 3 0.42s 0.67s 0.40s 0.41s 0.41s 10 0.53s 0.96s 0.39s 0.38s 0.44s 30 0.39s 0.76s 0.39s 0.39s 0.40s > Sure, but even then it'd be better to have more details about the > hardware. "M1" doesn't really say much (especially to people who don't > use Apple stuff very much). There's a range of M1-based systems, from > MacBook Air/Pro, Mini, Studio, ... I tried to find out some details of the hardware. It shows that I didn't even know the (marketing) model name... Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1 Apple SSD AP2048Z (2 TB NVMe), block size 4kb, protocol: Apple Fabric > > > > Well, I was hoping to be able to create a self balancing mechanism > > in read_stream_next_buffer > > > > /* Do we have to wait for an associated I/O first? */ > > if (stream->ios_in_progress > 0 && > > stream->ios[stream->oldest_io_index].buffer_index =3D=3D oldest_buffer_= index) > > { > > // prefetch and increase the distance while we wait here > > WaitReadBuffers(&stream->ios[io_index].op); > > ... > > } > > ... > > // this call could be removed if we prefetched earlier. > > read_stream_look_ahead(stream); > > > > > > There same principle that guided the > >> Don't wait for already in-progress IO > > patch. Here we should prioritise increasing the distance, and if it is > not > > possible (maybe we consumed all the buffers). We could take the > > opportunity to yield. > > > > > > IIUC the idea would be to (automatically) increase the distance just > enough so that the IOs complete right before we actually need the > buffers? That reminds me the "IO cost model" I mentioned a couple days > ago [1]. But it's not clear to me how this profiling helps with it. > Exactly, actually my investigation started from your (Tomas) suggestion, more specifically, this part: > I think a "proper" solution would require some sort of cost model for > the I/O part, so that we can schedule the I/Os just so that the I/O > completes right before we actually need the page. We are controlling the distance because we can, even knowing that that is not what we need. We need to prefetch enough to make sure the buffers will be ready when we need, but on the other hand we need to minimise the reads of buffers we don't need. It is clear that reading unnecessary blocks could lead to degradation of other queries in two ways. (1) delaying concurrent tasks that require I/O, be it queries, maintenance, replication. (2) future queries, by unnecessarily evicting buffers. If the system have queries reading on average N pages, and we read just one additional page we get 1/N overhead. For N=3D100 this is negligible, for N<=3D2 it is not. Yielding is important, but it is also important that the executor yield to the prefetch again before the distance is too low. DISTANCE INCREASE The current distance increase mechanism captures the idea of _completing right before we actually need the page_ by increasing distance when a wait is required (well, reflecting about it, I am not sure if that condition is free of false positives). I think that increasing exponentially is too aggressive. If waited for 10 buffers in a row you will have a distance of ~1000 (potential waste of 100x). Wouldn't it make more sense to increased by blocks in increments based on I/O combine limit and I/O concurrency? That way the waste is limited by a constant factor of the required buffers. DISTANCE DECREASE Also the interpretation of distance going up is different from the interpretation of distance going down. When it goes up it is an upper bound for the distance when it goes down it tracks the actual distance. In this sense I agree with something written in This could explain Andres Freund observation [1] > It seems caused to a significant degree by waiting at low queue depths. If I > comment out the stream->distance-- in read_stream_start_pending_read() th= e > regression is reduced greatly. CLOSED LOOP THOUGHTS Here I dump some ideas that we could work on, but but again, this is something that could go on a separate commit, without blocking this feature. Let distance D of a scan that will make sure a given scan have the buffers ready before consuming. For a long enough scan we can find d >=3D D. Label each operation with the distance (d_io) when it was started. Every time we wait for a buffer, we know that d_io < D (by definition) set d =3D max(d, d_io). Notice that the current model increments the distance based from its current value not the distance when the I/O started, this gives room for even more overshoot, but the prefetch is not immediate, probably that is damping it, and I speculate that the distance decay was added as a work around for this. On the other hand, decreasing distance when we consume a buffer will bring us again to risky ground, below the value that was previously seen to be necessary to ensure no waits. Maybe this decay was introduced just to counteract the excessive increase due to feedback on the current distance instead of the distance when the operations started. Increasing exponentially is that we may have a long gap between the computed distance and the actual required distance. Increasing it by steps based on maximum I/O concurrency we would still start as much I/O as could be useful, while keeping the gap between the distance and the optimal distance small, it could also make it even faster (just a little bit), by increasing the concurrency on the initial reads when distance would be 1,2,4,8. TIME AGNOSTIC MODEL A proper model would have distinguish the distance of the I/O being started, the I/O operation being finished, and the I/O distance that we are aiming for. start read: save io.distance =3D number of pinned buffers before this. loop: if(have to wait) target_distance =3D io.distance + concurrency; start_prefetch if(distance <=3D min_distance) distance +=3D prefetch(concurrency, combine_limit) if(buffer consumed) distance-- TIME AWARE MODEL The IO model could take delays in consideration. I know that timing has a cost, but hopefully it would be acceptable to add the timestamp of the IO when it starts. Then we could build a model based on an approximation of some quantile of the distance. If we know e.g. 50% of the IOs take longer than T(50%), then we could time the first few buffer consumption to to have a guess, say, tc is the average time per buffer required for this scan on this query. then T(50%) / tc, is a a distance that puts us within a 50% chance of requiring a read. But in order for this model to be effective we should keep it global, if we make it per query when we tuned the parameters the query already went too far, it could be saved in the index of stats, along with the number of scans, tuples heap fetches. A simpler, local version could be based on the time of the first IO, but how do we time it? Currently we don't have a callback to register when it is ready. Let's say that we time it just after the WaitReadBuffers. A pseudo code draft, that might be clearer than what I said init: t_start=3Dnow(); loop: buffer_count++; if(has to wait){ t_wait =3D now() t_per_buffer =3D buffer_count / t_wait; // is this wait blocking other scans?? WaitReadBuffer(io); t_ready =3D now(); t_io =3D t_ready - t_io_start; distance_lower_bound =3D Max(distance_min, io.distance); distance_guess =3D Max(distance_guess, t_io / t_per_buffer); t_start =3D t_ready; buffer_count =3D 0; } We have the timing overhead, but notice that this is independent of the time unit, or the epoch, some CPU tick count could be used as a cheaper alternative. And we may do this just a few times Distance lower bound would limit the distance decay, we already know that we had to wait for an IO started at io.distance, so, let's stay above that (unless we approach the scan limit). I hope this makes sense. I wrote about the signal overhead a couple months ago, with a simple > benchmark simulating it [.]. I also wrote a brief explanation [.] about > the AIO in PG18, which mentions that too. > Added to pending reads :) [1] https://www.postgresql.org/message-id/qdl4fojnbfcnm2k7b4zpvgd6gwzwdgtbl5c7s= hpimrb76dbyy6%40scdnspus3ejh --00000000000093a4da064c06d2c6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Sun= , Mar 1, 2026 at 11:33=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote:
On 3/1/26 23:32, Alexandre Felipe wrote:
>
> On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@= vondra.me>> wrote:
>
>=C2=A0 =C2=A0 =C2=A0Hi,
>
>=C2=A0 =C2=A0 =C2=A0I've decided to run a couple tests, trying to r= eproduce some of the
>=C2=A0 =C2=A0 =C2=A0behaviors described in your (Felipe's) messages= .
>
>
> Thank you,
> I will look into this data later. I am impressed=C2=A0with the number = of IO
> workers=C2=A0
> you used, my test was typically with 3.
>

3 is extremely low for an I/O bound system. It's our tradition to pick<= br> defaults that work even on tiny systems, but need tuning on actual
non-toy systems :-(
That is was a surprise for me,= because I am used to javascript
= that does everything=C2=A0in one=C2=A0single process (with a coroutine=C2= =A0
async model) and does=C2=A0wi= th very=C2=A0little overhead.
Cold Cache (buffer e= viction before each run):
io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<= ;=3D128
=C2=A03 0.68s=C2=A0 1.20s 1.29s 0.78s 0.68s
10 0.75s=C2=A0 1.= 02s 1.51s 1.62s 0.82s
30 0.75s=C2=A0 0.79s 2.95s 1.65s 1.43s
<= /div>
Warm= Cache (no eviction):
io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<=3D= 128
=C2=A03=C2=A0 = 0.42s 0.67s 0.40s 0.41s 0.41s
10=C2=A0 0.53s 0.96s 0.39s 0.38s 0.44s
= 30=C2=A0 0.39s 0.76s 0.39s 0.39s 0.40s

=C2=A0
Sure, but even then it'd be better to have more details about the
hardware. "M1" doesn't really say much (especially to people = who don't
use Apple stuff very much). There's a range of M1-based systems, from MacBook Air/Pro, Mini, Studio, ...
=C2=A0
I tried to find o= ut some details of the hardware. It shows that I didn't
even know the (marketing) model name...

Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1=C2=A0
Apple SSD AP2048Z (2 TB NVMe), block= size 4kb, protocol: Apple Fabric

=C2=A0
>
> Well, I was hoping to be able to create a self balancing mechanism
> in read_stream_next_buffer
>
> =C2=A0/* Do we have to wait for an associated I/O first? */
> if (stream->ios_in_progress > 0 &&
> stream->ios[stream->oldest_io_index].buffer_index =3D=3D oldest_= buffer_index)
> {
> =C2=A0 // prefetch and increase the distance while we wait here
> WaitReadBuffers(&stream->ios[io_index].op);
> =C2=A0...
> }
> ...
> // this call could be removed if we prefetched earlier.
> read_stream_look_ahead(stream);
>
>
> There same principle that guided the=C2=A0
>> Don't wait for already in-progress IO
> patch. Here we should prioritise increasing the distance, and if it is= not
> possible (maybe we consumed all the buffers). We could take the=C2=A0<= br> > opportunity to yield.
>
>

IIUC the idea would be to (automatically) increase the distance just
enough so that the IOs complete right before we actually need the
buffers? That reminds me the "IO cost model" I mentioned a couple= days
ago [1]. But it's not clear to me how this profiling helps with it.
=

Exactly, actually my investigation started from your = (Tomas)
suggestion,=C2=A0more spe= cifically, this part:

=
> I think a "proper" solution would require some= sort of cost model for
> the I/O part, so that we can schedule the I/Os just so that the = I/O
> completes= right before we actually need the page.

We are controlling the distance because we can, even knowing=C2= =A0
that=C2=A0that=C2=A0is not wh= at we need. We need to prefetch enough
to make sure the buffers will be ready when we need, but
=
on the other hand we need to minimise the rea= ds of buffers
we don't need.<= /font>

It is clear that reading unnecessary blocks could lead to=C2= =A0
degradation of other=C2=A0que= ries in two ways. (1) delaying
co= ncurrent tasks that require I/O, be it queries, maintenance,
replication. (2) future queries, by unnecessari= ly evicting
buffers.
=
If the system have queries reading on average= N pages, and
we read just one ad= ditional page we get 1/N overhead. For=C2=A0
N=3D100 this is negligible, for N<=3D2 it is not.

Yielding is important, but it is also important that the
executor yield to the prefetch again before the= distance
is too low.

DISTANCE INCREASE

<= /div>
The current distance increase mechanism = captures the idea of
_completing right before we actual= ly need the page_ by
incre= asing distance when a wait is required (well, reflecting
<= font face=3D"monospace">about it, I am not sure if that condition is free o= f false
positives). I think that = increasing exponentially is too=C2=A0
aggressive. If waited for 10 buffers in a row you will have
a distance of ~1000 (potential waste of 100= x).
Wouldn't it make more sen= se to increased by blocks in=C2=A0
increments based on=C2=A0I/O combine limit and=C2=A0I/O concurrency?
That way the waste is limited by a co= nstant factor=C2=A0of the
require= d buffers.

= DISTANCE DECREASE

Also the interpret= ation of distance going up is different
from the interpretation of distance going down.
When it goes up it is an upper bound for the distance=
when it goes down it tracks the = actual distance.
In this sense I = agree with something written in
<= br>
This could explain Andres Freund observation = [1]

>= It seems caused to a significant degree by waiting at low queue depths. I= f I
> comment out the stream->dist= ance-- in read_stream_start_pending_read() the
> regression is reduced greatly.





CLOSED LOOP THOUGHTS

Here I dump some ideas= that we could work on, but but again,
this is something that could go on a separate commit, without<= /div>
blocking this feature.
=
Let= distance D of a scan that will make sure a given scan
have the buffers ready before consuming. For a long e= nough scan
we can find d >=3D = D.

Label each operation with the distance (d_io) when it was = started.
Every time we wait for a= buffer, we know that d_io < D (by definition)
set d =3D max(d, d_io). Notice that the current model incr= ements the
distance based from it= s current value not the distance when the I/O
started, this gives room for even more overshoot, but the pr= efetch
is not immediate, probably= that is damping it, and I speculate that
the distance decay was added as a work around for this.


On the other hand, decre= asing distance when we consume a buffer
will bring us again to risky ground, below the value that was=C2=A0<= /font>
previously seen to be=C2=A0necess= ary to ensure no waits.=C2=A0Maybe this
decay was introduced just to counteract the excessive increase
due to feedback on the current distanc= e instead of the distance
when th= e operations started.

=
Increasing exponentially is that we may= have a long gap=C2=A0between the
computed distance and the actual required distance.=C2=A0Increasing=
it by steps based on maximum I/O concur= rency we would still start
as muc= h I/O as could be useful, while keeping the gap between
the distance and the optimal distance small, it coul= d also make it
even faster (just = a little bit), by increasing the concurrency on
the initial reads when distance would be 1,2,4,8.

TIME AGNOSTIC MODEL

A proper model would have distinguish= the distance of the I/O being
st= arted,=C2=A0the I/O operation being finished, and the I/O distance that=C2= =A0
we are aiming for.

start read:
=C2=A0 save io.di= stance =3D number of pinned buffers before this.
loop:
=C2=A0 if(h= ave to wait)
=C2=A0 =C2=A0 target= _distance =3D io.distance=C2=A0+ concurrency;
=C2=A0 =C2=A0 start_prefetch
=C2=A0 if(distance <=3D min_distance)
=C2=A0 =C2=A0 distance=C2=A0+=3D prefetch(concurrency, c= ombine_limit)
=C2=A0 =C2=A0 if(bu= ffer consumed)
=C2=A0 =C2=A0 =C2= =A0 distance--

<= div>TIME AWARE MODEL

The IO model = could take delays in consideration. I know that
timing has a cost, but hopefully it would be acceptable
to add the timestamp of the IO when i= t starts. Then we could
build a m= odel based on an approximation of some quantile of the
distance. If we know e.g. 50% of the IOs take longer<= /font>
than=C2=A0T(50%), then we could t= ime the first few buffer consumption
to to have a guess, say, tc is the average time per buffer=C2=A0=
required for this scan on this query. t= hen T(50%) / tc, is a=C2=A0
a dis= tance that puts us within a 50% chance of requiring a read.

B= ut in order for this model to be effective we should keep it
global, if we make it per query when we tuned t= he parameters the
query already w= ent too far, it could be saved in the index of stats,
along with the=C2=A0number of scans, tuples heap fetch= es.

A simpler, local version could be based on the time of th= e first
IO, but how do we time it= ? Currently we don't have a callback
to register when it is ready. Let's say that we time it just
after the WaitReadBuffers.

A pseudo code draft, that might be clearer than what I said

init:
t_start=3Dnow();
loop:
buffer_count++;
if(has = to wait){
=C2=A0 t_wait =3D now()=
=C2=A0 t_per_buffer =3D buffer_c= ount / t_wait;
=C2=A0 // is this = wait blocking other scans??
=C2= =A0 WaitReadBuffer(io);
=C2=A0 t_= ready =3D now();=C2=A0
=C2=A0 t_i= o =3D t_ready - t_io_start;
=C2= =A0 distance_lower_bound =3D Max(distance_min, io.distance);
=C2=A0 distance_guess =3D Max(distance_guess, t= _io / t_per_buffer);
=C2=A0 t_sta= rt =3D t_ready;
=C2=A0 buffer_cou= nt =3D 0;
}

We have= the timing overhead, but notice that this is independent=C2=A0
of the=C2=A0time unit, or the epoch, some CP= U tick count could be used
as a= =C2=A0cheaper alternative. And we may do this just a few times
=

Distance lower bound would limit the distance decay,=C2=A0we already=C2= =A0
know that we had to wait for = an IO started at io.distance, so,
let's stay above that (unless we approach the scan limit).

I hope this makes sense.

<= /font>

I wrote about the signal overhead a couple months ago, with a simple
benchmark simulating it [.]. I also wrote a brief explanation [.] about
the AIO in PG18, which mentions that too.

Added to= pending reads :)
--00000000000093a4da064c06d2c6--