MIME-Version: 1.0
References: 
 <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl>
 <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me>
 <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
 <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me>
 <ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
 <CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
 <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
 <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
 <issqornf6vdn3vb64fjuoathypmu3e5pgputd3lpfuvoeqyvzr@qfordnhplp2v>
 <CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com>
 <7e707787-272a-4c52-b5f1-5ac990514ecc@vondra.me>
 <c96ba898-02fb-4756-a1c7-0ddb08159804@vondra.me>
 <CAE8JnxPtia9m8y7b5s+gOMjZ_3QP=pTo+A6p_HmtrAV4PMo3ZQ@mail.gmail.com>
 <e972a7ab-015c-4026-8ef7-5a49f59f10f8@vondra.me>
In-Reply-To: <e972a7ab-015c-4026-8ef7-5a49f59f10f8@vondra.me>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Mon, 2 Mar 2026 09:00:34 +0000
Message-ID: 
 <CAE8JnxOJ48NU3rwW+gS67NUDKgxDS5pKNUywbUBGCBJkgUf+Hg@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>
Cc: Andres Freund <andres@anarazel.de>, Peter Geoghegan <pg@bowt.ie>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: multipart/alternative; boundary="00000000000093a4da064c06d2c6"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE8JnxOJ48NU3rwW%2BgS67NUDKgxDS5pKNUywbUBGCBJkgUf%2BHg%40mail.gmail.com>
Precedence: bulk

--00000000000093a4da064c06d2c6
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, Mar 1, 2026 at 11:33=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrot=
e:

> On 3/1/26 23:32, Alexandre Felipe wrote:
> >
> > On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra <tomas@vondra.me
> > <mailto:tomas@vondra.me>> wrote:
> >
> >     Hi,
> >
> >     I've decided to run a couple tests, trying to reproduce some of the
> >     behaviors described in your (Felipe's) messages.
> >
> >
> > Thank you,
> > I will look into this data later. I am impressed with the number of IO
> > workers
> > you used, my test was typically with 3.
> >
>
> 3 is extremely low for an I/O bound system. It's our tradition to pick
> defaults that work even on tiny systems, but need tuning on actual
> non-toy systems :-(
>

That is was a surprise for me, because I am used to javascript
that does everything in one single process (with a coroutine
async model) and does with very little overhead.

Cold Cache (buffer eviction before each run):
io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<=3D128
 3 0.68s  1.20s 1.29s 0.78s 0.68s
10 0.75s  1.02s 1.51s 1.62s 0.82s
30 0.75s  0.79s 2.95s 1.65s 1.43s

Warm Cache (no eviction):
io pf=3Doff pf=3Don d<=3D16 d<=3D64 d<=3D128
 3  0.42s 0.67s 0.40s 0.41s 0.41s
10  0.53s 0.96s 0.39s 0.38s 0.44s
30  0.39s 0.76s 0.39s 0.39s 0.40s


> Sure, but even then it'd be better to have more details about the
> hardware. "M1" doesn't really say much (especially to people who don't
> use Apple stuff very much). There's a range of M1-based systems, from
> MacBook Air/Pro, Mini, Studio, ...


I tried to find out some details of the hardware. It shows that I didn't
even know the (marketing) model name...

Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1
Apple SSD AP2048Z (2 TB NVMe), block size 4kb, protocol: Apple Fabric


> >
> > Well, I was hoping to be able to create a self balancing mechanism
> > in read_stream_next_buffer
> >
> >  /* Do we have to wait for an associated I/O first? */
> > if (stream->ios_in_progress > 0 &&
> > stream->ios[stream->oldest_io_index].buffer_index =3D=3D oldest_buffer_=
index)
> > {
> >   // prefetch and increase the distance while we wait here
> > WaitReadBuffers(&stream->ios[io_index].op);
> >  ...
> > }
> > ...
> > // this call could be removed if we prefetched earlier.
> > read_stream_look_ahead(stream);
> >
> >
> > There same principle that guided the
> >> Don't wait for already in-progress IO
> > patch. Here we should prioritise increasing the distance, and if it is
> not
> > possible (maybe we consumed all the buffers). We could take the
> > opportunity to yield.
> >
> >
>
> IIUC the idea would be to (automatically) increase the distance just
> enough so that the IOs complete right before we actually need the
> buffers? That reminds me the "IO cost model" I mentioned a couple days
> ago [1]. But it's not clear to me how this profiling helps with it.
>

Exactly, actually my investigation started from your (Tomas)
suggestion, more specifically, this part:

> I think a "proper" solution would require some sort of cost model for
> the I/O part, so that we can schedule the I/Os just so that the I/O
> completes right before we actually need the page.

We are controlling the distance because we can, even knowing
that that is not what we need. We need to prefetch enough
to make sure the buffers will be ready when we need, but
on the other hand we need to minimise the reads of buffers
we don't need.

It is clear that reading unnecessary blocks could lead to
degradation of other queries in two ways. (1) delaying
concurrent tasks that require I/O, be it queries, maintenance,
replication. (2) future queries, by unnecessarily evicting
buffers.
If the system have queries reading on average N pages, and
we read just one additional page we get 1/N overhead. For
N=3D100 this is negligible, for N<=3D2 it is not.

Yielding is important, but it is also important that the
executor yield to the prefetch again before the distance
is too low.

DISTANCE INCREASE

The current distance increase mechanism captures the idea of
_completing right before we actually need the page_ by
increasing distance when a wait is required (well, reflecting
about it, I am not sure if that condition is free of false
positives). I think that increasing exponentially is too
aggressive. If waited for 10 buffers in a row you will have
a distance of ~1000 (potential waste of 100x).
Wouldn't it make more sense to increased by blocks in
increments based on I/O combine limit and I/O concurrency?
That way the waste is limited by a constant factor of the
required buffers.

DISTANCE DECREASE

Also the interpretation of distance going up is different
from the interpretation of distance going down.
When it goes up it is an upper bound for the distance
when it goes down it tracks the actual distance.
In this sense I agree with something written in

This could explain Andres Freund observation [1]

> It seems caused to a significant degree by waiting at low queue depths.
If I
> comment out the stream->distance-- in read_stream_start_pending_read() th=
e
> regression is reduced greatly.


CLOSED LOOP THOUGHTS

Here I dump some ideas that we could work on, but but again,
this is something that could go on a separate commit, without
blocking this feature.

Let distance D of a scan that will make sure a given scan
have the buffers ready before consuming. For a long enough scan
we can find d >=3D D.

Label each operation with the distance (d_io) when it was started.
Every time we wait for a buffer, we know that d_io < D (by definition)
set d =3D max(d, d_io). Notice that the current model increments the
distance based from its current value not the distance when the I/O
started, this gives room for even more overshoot, but the prefetch
is not immediate, probably that is damping it, and I speculate that
the distance decay was added as a work around for this.


On the other hand, decreasing distance when we consume a buffer
will bring us again to risky ground, below the value that was
previously seen to be necessary to ensure no waits. Maybe this
decay was introduced just to counteract the excessive increase
due to feedback on the current distance instead of the distance
when the operations started.

Increasing exponentially is that we may have a long gap between the
computed distance and the actual required distance. Increasing
it by steps based on maximum I/O concurrency we would still start
as much I/O as could be useful, while keeping the gap between
the distance and the optimal distance small, it could also make it
even faster (just a little bit), by increasing the concurrency on
the initial reads when distance would be 1,2,4,8.

TIME AGNOSTIC MODEL

A proper model would have distinguish the distance of the I/O being
started, the I/O operation being finished, and the I/O distance that
we are aiming for.

start read:
  save io.distance =3D number of pinned buffers before this.
loop:
  if(have to wait)
    target_distance =3D io.distance + concurrency;
    start_prefetch
  if(distance <=3D min_distance)
    distance +=3D prefetch(concurrency, combine_limit)
    if(buffer consumed)
      distance--

TIME AWARE MODEL

The IO model could take delays in consideration. I know that
timing has a cost, but hopefully it would be acceptable
to add the timestamp of the IO when it starts. Then we could
build a model based on an approximation of some quantile of the
distance. If we know e.g. 50% of the IOs take longer
than T(50%), then we could time the first few buffer consumption
to to have a guess, say, tc is the average time per buffer
required for this scan on this query. then T(50%) / tc, is a
a distance that puts us within a 50% chance of requiring a read.

But in order for this model to be effective we should keep it
global, if we make it per query when we tuned the parameters the
query already went too far, it could be saved in the index of stats,
along with the number of scans, tuples heap fetches.

A simpler, local version could be based on the time of the first
IO, but how do we time it? Currently we don't have a callback
to register when it is ready. Let's say that we time it just
after the WaitReadBuffers.

A pseudo code draft, that might be clearer than what I said

init:
t_start=3Dnow();
loop:
buffer_count++;
if(has to wait){
  t_wait =3D now()
  t_per_buffer =3D buffer_count / t_wait;
  // is this wait blocking other scans??
  WaitReadBuffer(io);
  t_ready =3D now();
  t_io =3D t_ready - t_io_start;
  distance_lower_bound =3D Max(distance_min, io.distance);
  distance_guess =3D Max(distance_guess, t_io / t_per_buffer);
  t_start =3D t_ready;
  buffer_count =3D 0;
}

We have the timing overhead, but notice that this is independent
of the time unit, or the epoch, some CPU tick count could be used
as a cheaper alternative. And we may do this just a few times

Distance lower bound would limit the distance decay, we already
know that we had to wait for an IO started at io.distance, so,
let's stay above that (unless we approach the scan limit).

I hope this makes sense.


I wrote about the signal overhead a couple months ago, with a simple
> benchmark simulating it [.]. I also wrote a brief explanation [.] about
> the AIO in PG18, which mentions that too.
>

Added to pending reads :)

[1]
https://www.postgresql.org/message-id/qdl4fojnbfcnm2k7b4zpvgd6gwzwdgtbl5c7s=
hpimrb76dbyy6%40scdnspus3ejh

--00000000000093a4da064c06d2c6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><font face=3D"monospace"=
><br></font></div><font face=3D"monospace"><br></font><div class=3D"gmail_q=
uote"><div dir=3D"ltr" class=3D"gmail_attr"><font face=3D"monospace">On Sun=
, Mar 1, 2026 at 11:33=E2=80=AFPM Tomas Vondra &lt;<a href=3D"mailto:tomas@=
vondra.me" target=3D"_blank">tomas@vondra.me</a>&gt; wrote:<br></font></div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex"><font face=3D"monospace"=
>On 3/1/26 23:32, Alexandre Felipe wrote:<br>
&gt; <br>
&gt; On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra &lt;<a href=3D"mai=
lto:tomas@vondra.me" target=3D"_blank">tomas@vondra.me</a><br>
&gt; &lt;mailto:<a href=3D"mailto:tomas@vondra.me" target=3D"_blank">tomas@=
vondra.me</a>&gt;&gt; wrote:<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0Hi,<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0I&#39;ve decided to run a couple tests, trying to r=
eproduce some of the<br>
&gt;=C2=A0 =C2=A0 =C2=A0behaviors described in your (Felipe&#39;s) messages=
.<br>
&gt; <br>
&gt; <br>
&gt; Thank you,<br>
&gt; I will look into this data later. I am impressed=C2=A0with the number =
of IO<br>
&gt; workers=C2=A0<br>
&gt; you used, my test was typically with 3.<br>
&gt; <br>
<br>
3 is extremely low for an I/O bound system. It&#39;s our tradition to pick<=
br>
defaults that work even on tiny systems, but need tuning on actual<br>
non-toy systems :-(<br></font></blockquote><div><font face=3D"monospace"><b=
r></font></div><div><font face=3D"monospace">That is was a surprise for me,=
 because I am used to javascript</font></div><div><font face=3D"monospace">=
that does everything=C2=A0in one=C2=A0single process (with a coroutine=C2=
=A0</font></div><div><font face=3D"monospace">async model) and does=C2=A0wi=
th very=C2=A0little overhead.</font></div><div><font face=3D"monospace"><br=
></font></div><div></div><div><font face=3D"monospace">Cold Cache (buffer e=
viction before each run):<br>io pf=3Doff pf=3Don d&lt;=3D16 d&lt;=3D64 d&lt=
;=3D128<br>=C2=A03	0.68s=C2=A0 1.20s	1.29s	0.78s	0.68s<br>10	0.75s=C2=A0 1.=
02s	1.51s	1.62s	0.82s<br>30	0.75s=C2=A0 0.79s	2.95s	1.65s	1.43s<br></font><=
/div><font face=3D"monospace"><br></font><div><font face=3D"monospace">Warm=
 Cache (no eviction):<br>io pf=3Doff pf=3Don d&lt;=3D16 d&lt;=3D64 d&lt;=3D=
128<br style=3D""></font></div><div><font face=3D"monospace">=C2=A03=C2=A0 =
0.42s	0.67s	0.40s	0.41s	0.41s<br>10=C2=A0 0.53s	0.96s	0.39s	0.38s	0.44s<br>=
30=C2=A0 0.39s	0.76s	0.39s	0.39s	0.40s</font></div><div><font face=3D"monos=
pace"><br></font></div><div><font face=3D"monospace">=C2=A0</font></div><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef=
t:1px solid rgb(204,204,204);padding-left:1ex"><font face=3D"monospace">
Sure, but even then it&#39;d be better to have more details about the<br>
hardware. &quot;M1&quot; doesn&#39;t really say much (especially to people =
who don&#39;t<br>
use Apple stuff very much). There&#39;s a range of M1-based systems, from<b=
r>
MacBook Air/Pro, Mini, Studio, ...</font></blockquote><div><font face=3D"mo=
nospace">=C2=A0</font></div><div><font face=3D"monospace">I tried to find o=
ut some details of the hardware. It shows that I didn&#39;t</font></div><di=
v><font face=3D"monospace">even know the (marketing) model name...</font></=
div><div><font face=3D"monospace"><br></font></div><div><font face=3D"monos=
pace">Chip: Apple M2, 8 (physical) cores, 24 GB, macOS Tahoe 26.1=C2=A0</fo=
nt></div><div><font face=3D"monospace">Apple SSD AP2048Z (2 TB NVMe), block=
 size 4kb, protocol: Apple Fabric</font></div><div><font face=3D"monospace"=
><br></font></div><div><font face=3D"monospace">=C2=A0</font></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex"><font face=3D"monospace">
&gt; <br>
&gt; Well, I was hoping to be able to create a self balancing mechanism<br>
&gt; in read_stream_next_buffer<br>
&gt; <br>
&gt; =C2=A0/* Do we have to wait for an associated I/O first? */<br>
&gt; if (stream-&gt;ios_in_progress &gt; 0 &amp;&amp;<br>
&gt; stream-&gt;ios[stream-&gt;oldest_io_index].buffer_index =3D=3D oldest_=
buffer_index)<br>
&gt; {<br>
&gt; =C2=A0 // prefetch and increase the distance while we wait here<br>
&gt; WaitReadBuffers(&amp;stream-&gt;ios[io_index].op);<br>
&gt; =C2=A0...<br>
&gt; }<br>
&gt; ...<br>
&gt; // this call could be removed if we prefetched earlier.<br>
&gt; read_stream_look_ahead(stream);<br>
&gt; <br>
&gt; <br>
&gt; There same principle that guided the=C2=A0<br>
&gt;&gt; Don&#39;t wait for already in-progress IO<br>
&gt; patch. Here we should prioritise increasing the distance, and if it is=
 not<br>
&gt; possible (maybe we consumed all the buffers). We could take the=C2=A0<=
br>
&gt; opportunity to yield.<br>
&gt; <br>
&gt; <br>
<br>
IIUC the idea would be to (automatically) increase the distance just<br>
enough so that the IOs complete right before we actually need the<br>
buffers? That reminds me the &quot;IO cost model&quot; I mentioned a couple=
 days<br>
ago [1]. But it&#39;s not clear to me how this profiling helps with it.<br>=
</font></blockquote><div><font face=3D"monospace"><br></font></div><div><fo=
nt face=3D"monospace">Exactly, actually my investigation started from your =
(Tomas)</font></div><div><font face=3D"monospace">suggestion,=C2=A0more spe=
cifically, this part:</font></div><div><font face=3D"monospace"><br></font>=
</div><div><font face=3D"monospace"><span style=3D"color:rgb(0,0,0);font-si=
ze:10.7333px">&gt; I think a &quot;proper&quot; solution would require some=
 sort of cost model for</span><br style=3D"box-sizing:border-box;color:rgb(=
0,0,0);font-size:10.7333px"><span style=3D"color:rgb(0,0,0);font-size:10.73=
33px">&gt; the I/O part, so that we can schedule the I/Os just so that the =
I/O</span><br style=3D"box-sizing:border-box;color:rgb(0,0,0);font-size:10.=
7333px"><span style=3D"color:rgb(0,0,0);font-size:10.7333px">&gt; completes=
 right before we actually need the page.
<br></span></font></div><div><span style=3D"color:rgb(0,0,0);font-size:10.7=
333px"><font face=3D"monospace"><br></font></span></div><div><font face=3D"=
monospace">We are controlling the distance because we can, even knowing=C2=
=A0</font></div><div><font face=3D"monospace">that=C2=A0that=C2=A0is not wh=
at we need. We need to prefetch enough</font></div><div><font face=3D"monos=
pace">to make sure the buffers will be ready when we need, but</font></div>=
<div><font face=3D"monospace">on the other hand we need to minimise the rea=
ds of buffers</font></div><div><font face=3D"monospace">we don&#39;t need.<=
/font></div><div><font face=3D"monospace"><br></font></div><div><font face=
=3D"monospace">It is clear that reading unnecessary blocks could lead to=C2=
=A0</font></div><div><font face=3D"monospace">degradation of other=C2=A0que=
ries in two ways. (1) delaying</font></div><div><font face=3D"monospace">co=
ncurrent tasks that require I/O, be it queries, maintenance,</font></div><d=
iv><font face=3D"monospace">replication. (2) future queries, by unnecessari=
ly evicting</font></div><div><font face=3D"monospace">buffers.</font></div>=
<div><font face=3D"monospace">If the system have queries reading on average=
 N pages, and</font></div><div><font face=3D"monospace">we read just one ad=
ditional page we get 1/N overhead. For=C2=A0</font></div><div><font face=3D=
"monospace">N=3D100 this is negligible, for N&lt;=3D2 it is not.</font></di=
v><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospa=
ce">Yielding is important, but it is also important that the</font></div><d=
iv><font face=3D"monospace">executor yield to the prefetch again before the=
 distance</font></div><div><font face=3D"monospace">is too low.</font></div=
><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospac=
e">DISTANCE INCREASE</font></div><div><font face=3D"monospace"><br></font><=
/div><div><font face=3D"monospace">The current distance increase mechanism =
captures the idea of</font></div><div><span style=3D"color:rgb(0,0,0);font-=
size:10.7333px"><font face=3D"monospace">_completing right before we actual=
ly need the page_ by</font></span></div><div><font face=3D"monospace">incre=
asing distance when a wait is required (well, reflecting</font></div><div><=
font face=3D"monospace">about it, I am not sure if that condition is free o=
f false</font></div><div><font face=3D"monospace">positives). I think that =
increasing exponentially is too=C2=A0</font></div><div><font face=3D"monosp=
ace">aggressive. If waited for 10 buffers in a row you will have</font></di=
v><div><font face=3D"monospace">a distance of ~1000 (potential waste of 100=
x).</font></div><div><font face=3D"monospace">Wouldn&#39;t it make more sen=
se to increased by blocks in=C2=A0</font></div><div><font face=3D"monospace=
">increments based on=C2=A0I/O combine limit and=C2=A0I/O concurrency?</fon=
t></div><div><font face=3D"monospace">That way the waste is limited by a co=
nstant factor=C2=A0of the</font></div><div><font face=3D"monospace">require=
d buffers.</font></div><div><font face=3D"monospace"><br></font></div><div>=
<font face=3D"monospace">DISTANCE DECREASE</font></div><div><font face=3D"m=
onospace"><br></font></div><div><font face=3D"monospace">Also the interpret=
ation of distance going up is different</font></div><div><font face=3D"mono=
space">from the interpretation of distance going down.</font></div><div><fo=
nt face=3D"monospace">When it goes up it is an upper bound for the distance=
</font></div><div><font face=3D"monospace">when it goes down it tracks the =
actual distance.</font></div><div><font face=3D"monospace">In this sense I =
agree with something written in</font></div><div><font face=3D"monospace"><=
br></font></div><div><div><font color=3D"#000000" face=3D"monospace"><span =
style=3D"font-size:10.7333px">This could explain Andres Freund observation =
[1]</span></font></div></div><div><p style=3D"box-sizing:border-box;margin-=
top:0px;color:rgb(0,0,0);font-size:10.7333px"><font face=3D"monospace">&gt;=
 It seems caused to a significant degree by waiting at low queue depths.  I=
f I<br style=3D"box-sizing:border-box">&gt; comment out the stream-&gt;dist=
ance-- in read_stream_start_pending_read() the<br style=3D"box-sizing:borde=
r-box">&gt; regression is reduced greatly.</font></p><p style=3D"box-sizing=
:border-box;margin-top:0px;color:rgb(0,0,0);font-size:10.7333px"><font face=
=3D"monospace"><br></font></p><p style=3D"box-sizing:border-box;margin-top:=
0px;color:rgb(0,0,0);font-size:10.7333px"><font face=3D"monospace"><br></fo=
nt></p><font face=3D"monospace"><br class=3D"gmail-Apple-interchange-newlin=
e"></font></div><div><font face=3D"monospace"><br></font></div><div><font f=
ace=3D"monospace">CLOSED LOOP THOUGHTS</font></div><div><font face=3D"monos=
pace"><br></font></div><div><font face=3D"monospace">Here I dump some ideas=
 that we could work on, but but again,</font></div><div><font face=3D"monos=
pace">this is something that could go on a separate commit, without</font><=
/div><div><font face=3D"monospace">blocking this feature.</font></div><div>=
<font face=3D"monospace"><br></font></div><div><font face=3D"monospace">Let=
 distance D of a scan that will make sure a given scan</font></div><div><fo=
nt face=3D"monospace">have the buffers ready before consuming. For a long e=
nough scan</font></div><div><font face=3D"monospace">we can find d &gt;=3D =
D.</font></div><div><font face=3D"monospace"><br></font></div><div><font fa=
ce=3D"monospace">Label each operation with the distance (d_io) when it was =
started.</font></div><div><font face=3D"monospace">Every time we wait for a=
 buffer, we know that d_io &lt; D (by definition)</font></div><div><font fa=
ce=3D"monospace">set d =3D max(d, d_io). Notice that the current model incr=
ements the</font></div><div><font face=3D"monospace">distance based from it=
s current value not the distance when the I/O</font></div><div><font face=
=3D"monospace">started, this gives room for even more overshoot, but the pr=
efetch</font></div><div><font face=3D"monospace">is not immediate, probably=
 that is damping it, and I speculate that</font></div><div><font face=3D"mo=
nospace">the distance decay was added as a work around for this.</font></di=
v><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospa=
ce"><br></font></div><div><font face=3D"monospace">On the other hand, decre=
asing distance when we consume a buffer</font></div><div><font face=3D"mono=
space">will bring us again to risky ground, below the value that was=C2=A0<=
/font></div><div><font face=3D"monospace">previously seen to be=C2=A0necess=
ary to ensure no waits.=C2=A0Maybe this</font></div><div><font face=3D"mono=
space">decay was introduced just to counteract the excessive increase</font=
></div><div><font face=3D"monospace">due to feedback on the current distanc=
e instead of the distance</font></div><div><font face=3D"monospace">when th=
e operations started.</font></div><div><font face=3D"monospace"><br></font>=
</div><div><font face=3D"monospace">Increasing exponentially is that we may=
 have a long gap=C2=A0between the</font></div><div><font face=3D"monospace"=
>computed distance and the actual required distance.=C2=A0Increasing</font>=
</div><div><font face=3D"monospace">it by steps based on maximum I/O concur=
rency we would still start</font></div><div><font face=3D"monospace">as muc=
h I/O as could be useful, while keeping the gap between</font></div><div><f=
ont face=3D"monospace">the distance and the optimal distance small, it coul=
d also make it</font></div><div><font face=3D"monospace">even faster (just =
a little bit), by increasing the concurrency on</font></div><div><font face=
=3D"monospace">the initial reads when distance would be 1,2,4,8.</font></di=
v><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospa=
ce">TIME AGNOSTIC MODEL</font></div><div><font face=3D"monospace"><br></fon=
t></div><div><font face=3D"monospace">A proper model would have distinguish=
 the distance of the I/O being</font></div><div><font face=3D"monospace">st=
arted,=C2=A0the I/O operation being finished, and the I/O distance that=C2=
=A0</font></div><div><font face=3D"monospace">we are aiming for.</font></di=
v><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospa=
ce">start read:</font></div><div><font face=3D"monospace">=C2=A0 save io.di=
stance =3D number of pinned buffers before this.</font></div><div><font fac=
e=3D"monospace">loop:</font></div><div><font face=3D"monospace">=C2=A0 if(h=
ave to wait)</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 target=
_distance =3D io.distance=C2=A0+ concurrency;</font></div><div><font face=
=3D"monospace">=C2=A0 =C2=A0 start_prefetch</font></div><div><font face=3D"=
monospace">=C2=A0 if(distance &lt;=3D min_distance)</font></div><div><font =
face=3D"monospace">=C2=A0 =C2=A0 distance=C2=A0+=3D prefetch(concurrency, c=
ombine_limit)</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 if(bu=
ffer consumed)</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 =C2=
=A0 distance--</font></div><div><font face=3D"monospace"><br></font></div><=
div><font face=3D"monospace">TIME AWARE MODEL</font></div><div><font face=
=3D"monospace"><br></font></div><div><font face=3D"monospace">The IO model =
could take delays in consideration. I know that</font></div><div><font face=
=3D"monospace">timing has a cost, but hopefully it would be acceptable</fon=
t></div><div><font face=3D"monospace">to add the timestamp of the IO when i=
t starts. Then we could</font></div><div><font face=3D"monospace">build a m=
odel based on an approximation of some quantile of the</font></div><div><fo=
nt face=3D"monospace">distance. If we know e.g. 50% of the IOs take longer<=
/font></div><div><font face=3D"monospace">than=C2=A0T(50%), then we could t=
ime the first few buffer consumption</font></div><div><font face=3D"monospa=
ce">to to have a guess, say, tc is the average time per buffer=C2=A0</font>=
</div><div><font face=3D"monospace">required for this scan on this query. t=
hen T(50%) / tc, is a=C2=A0</font></div><div><font face=3D"monospace">a dis=
tance that puts us within a 50% chance of requiring a read.</font></div><di=
v><font face=3D"monospace"><br></font></div><div><font face=3D"monospace">B=
ut in order for this model to be effective we should keep it</font></div><d=
iv><font face=3D"monospace">global, if we make it per query when we tuned t=
he parameters the</font></div><div><font face=3D"monospace">query already w=
ent too far, it could be saved in the index of stats,</font></div><div><fon=
t face=3D"monospace">along with the=C2=A0number of scans, tuples heap fetch=
es.</font></div><div><font face=3D"monospace"><br></font></div><div><font f=
ace=3D"monospace">A simpler, local version could be based on the time of th=
e first</font></div><div><font face=3D"monospace">IO, but how do we time it=
? Currently we don&#39;t have a callback</font></div><div><font face=3D"mon=
ospace">to register when it is ready. Let&#39;s say that we time it just</f=
ont></div><div><font face=3D"monospace">after the WaitReadBuffers.</font></=
div><div><font face=3D"monospace"><br></font></div><div><font face=3D"monos=
pace">A pseudo code draft, that might be clearer than what I said</font></d=
iv><div><font face=3D"monospace"><br></font></div><div><font face=3D"monosp=
ace">init:</font></div><div><font face=3D"monospace">t_start=3Dnow();</font=
></div><div><font face=3D"monospace">loop:</font></div><div><font face=3D"m=
onospace">buffer_count++;</font></div><div><font face=3D"monospace">if(has =
to wait){</font></div><div><font face=3D"monospace">=C2=A0 t_wait =3D now()=
</font></div><div><font face=3D"monospace">=C2=A0 t_per_buffer =3D buffer_c=
ount / t_wait;</font></div><div><font face=3D"monospace">=C2=A0 // is this =
wait blocking other scans??</font></div><div><font face=3D"monospace">=C2=
=A0 WaitReadBuffer(io);</font></div><div><font face=3D"monospace">=C2=A0 t_=
ready =3D now();=C2=A0</font></div><div><font face=3D"monospace">=C2=A0 t_i=
o =3D t_ready - t_io_start;</font></div><div><font face=3D"monospace">=C2=
=A0 distance_lower_bound =3D Max(distance_min, io.distance);</font></div><d=
iv><font face=3D"monospace">=C2=A0 distance_guess =3D Max(distance_guess, t=
_io / t_per_buffer);</font></div><div><font face=3D"monospace">=C2=A0 t_sta=
rt =3D t_ready;</font></div><div><font face=3D"monospace">=C2=A0 buffer_cou=
nt =3D 0;</font></div><div><font face=3D"monospace">}</font></div><div><fon=
t face=3D"monospace"><br></font></div><div><font face=3D"monospace">We have=
 the timing overhead, but notice that this is independent=C2=A0</font></div=
><div><font face=3D"monospace">of the=C2=A0time unit, or the epoch, some CP=
U tick count could be used</font></div><div><font face=3D"monospace">as a=
=C2=A0cheaper alternative. And we may do this just a few times</font></div>=
<div><font face=3D"monospace"><br></font></div><div><font face=3D"monospace=
">Distance lower bound would limit the distance decay,=C2=A0we already=C2=
=A0</font></div><div><font face=3D"monospace">know that we had to wait for =
an IO started at io.distance, so,</font></div><div><font face=3D"monospace"=
>let&#39;s stay above that (unless we approach the scan limit).</font></div=
><div><font face=3D"monospace"><br></font></div><div><font face=3D"monospac=
e">I hope this makes sense.</font></div><div><font face=3D"monospace"><br><=
/font></div><div><font face=3D"monospace"><br></font></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex"><font face=3D"monospace">
I wrote about the signal overhead a couple months ago, with a simple<br>
benchmark simulating it [.]. I also wrote a brief explanation [.] about<br>
the AIO in PG18, which mentions that too.<br></font></blockquote><div><font=
 face=3D"monospace"><br></font></div><div><font face=3D"monospace">Added to=
 pending reads :)</font></div><div><font face=3D"monospace"><br></font></di=
v><div><font face=3D"monospace">[1]=C2=A0</font></div><div><font face=3D"mo=
nospace"><a href=3D"https://www.postgresql.org/message-id/qdl4fojnbfcnm2k7b=
4zpvgd6gwzwdgtbl5c7shpimrb76dbyy6%40scdnspus3ejh">https://www.postgresql.or=
g/message-id/qdl4fojnbfcnm2k7b4zpvgd6gwzwdgtbl5c7shpimrb76dbyy6%40scdnspus3=
ejh</a>=C2=A0</font></div></div></div>
</div>

--00000000000093a4da064c06d2c6--