MIME-Version: 1.0
References: 
 <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl>
 <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me>
 <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
 <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me>
 <ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
 <CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
 <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
 <CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
 <issqornf6vdn3vb64fjuoathypmu3e5pgputd3lpfuvoeqyvzr@qfordnhplp2v>
 <CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com>
 <7e707787-272a-4c52-b5f1-5ac990514ecc@vondra.me>
 <c96ba898-02fb-4756-a1c7-0ddb08159804@vondra.me>
In-Reply-To: <c96ba898-02fb-4756-a1c7-0ddb08159804@vondra.me>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Sun, 1 Mar 2026 22:32:53 +0000
Message-ID: 
 <CAE8JnxPtia9m8y7b5s+gOMjZ_3QP=pTo+A6p_HmtrAV4PMo3ZQ@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>
Cc: Andres Freund <andres@anarazel.de>, Peter Geoghegan <pg@bowt.ie>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: multipart/alternative; boundary="000000000000c7709d064bfe0d86"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE8JnxPtia9m8y7b5s%2BgOMjZ_3QP%3DpTo%2BA6p_HmtrAV4PMo3ZQ%40mail.gmail.com>
Precedence: bulk

--000000000000c7709d064bfe0d86
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote=
:

> Hi,
>
> I've decided to run a couple tests, trying to reproduce some of the
> behaviors described in your (Felipe's) messages.
>

Thank you,
I will look into this data later. I am impressed with the number of IO
workers
you used, my test was typically with 3.

I'm not trying to redo the tests exactly, because (a) I don't have a M1
> machine, and (b) there's not enough details about the hardware and
> configuration to actually redo it properly.
>

Well I was running on a M1 because this is what I have in front of me
but I know that any serious database will run on linux.


> I've focused on quantifying the impact of a couple things mentioned in
> the previous message:


I will have a look into this later and compute the effect size.

The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method =3D (worker | io_uring)
> * shared_buffers =3D (128MB | 8GB)
> * enable_indexscan_prefetch =3D (on | off)
> * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)


There are literally only 4 cases where prefetching does worse than
> master, and those are for random data with distance limit 1. I claim
> this is irrelevant, because it literally disables prefetching while
> still paying the full cost (all 4 are for io_method=3Dworker, where the
> signal overhead can be high, so it's not a surprise).
>

I agree with your claim, the idea of the distance limit was to separate
to have an idea of AIO overhead without the benefit of prefetch, because
I was seeing very similar results, but when I controlled the distance
the prefetch benefit became visible. And also the gradation would
show if this has a U shape or the larger the distance the better the
performance.

It's a bit like buying a race horse, break its leg
> and then complain it's not running very fast
>
 =F0=9F=98=82


The overhead of the instrumentation seems relatively small, probably
> within 5% or so. That's a bit less than I expected, but I still don't
> understand what this is meant to say us. It's measuring wall-time, and
> it's no surprise that in an I/O-bound workload most of the time is spent
> in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
> But it does not give any indication *why*.
>

Well, I was hoping to be able to create a self balancing mechanism
in read_stream_next_buffer

 /* Do we have to wait for an associated I/O first? */
if (stream->ios_in_progress > 0 &&
stream->ios[stream->oldest_io_index].buffer_index =3D=3D oldest_buffer_inde=
x)
{
  // prefetch and increase the distance while we wait here
WaitReadBuffers(&stream->ios[io_index].op);
 ...
}
...
// this call could be removed if we prefetched earlier.
read_stream_look_ahead(stream);


There same principle that guided the
> Don't wait for already in-progress IO
patch. Here we should prioritise increasing the distance, and if it is not
possible (maybe we consumed all the buffers). We could take the
opportunity to yield.


>
> multi-client test (multi-client.tgz)
> ------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method =3D (worker | io_uring)
> * io_workers =3D (12 | 32)
> * shared_buffers =3D (128MB | 8GB)
> * enable_indexscan_prefetch =3D (on | off)
> * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
> * number of parallel workers (1, 2, 4, 8)
>

 Are parallel workers here clients issuing queries?

This all seems perfectly fine to me. The bad behavior would be if the
> prefetching gets slower than master, because that would be a regression
> affecting users. But that happens only in 4 cells in the table.


And in this case we have to take the other extremum, and run the queries
where the prefetch is not expected to help. In this sense I agree with Pete=
r
that the yielding logic is important. We may be limiting the potential of
the
prefetch in some cases but excessive reads is the highest risk in my
opinion.
You may know better than me, but I talk about the workloads I have seen
or worked with, it is typically a high number of small queries. Not these
huge
scans.
Large queries are rare, and when they come to our attention is because
they used too much memory and started to create temporary files.

(But I'm speculating, I haven't investigated this in detail yet.)
>

Fair enough.

Moreover, io_uring does not have this issue. Which is another indication
> it's something about the signal overhead.
>

That is interesting.


> In any case, these results clearly prefetching can be a huge improvement
> even in environments with concurrent activity, etc.
>
>
> If you see something different on the Mac, you need to investigate why.
> It could be something in the OS, or maybe it it's hardware specific
> thing (consumer SSDs can choke on too many requests). Hard to say. I
> don't even know what kind of M1 machine you have, what SSD etc.
>

My guess is that the cause is IPC, I don't know well how the
async IO works, but if it is a different process I think that MacOS is
by less efficient than linux.  But I don't know how to measure that.

Regards,
Alexandre

On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote=
:

> Hi,
>
> I've decided to run a couple tests, trying to reproduce some of the
> behaviors described in your (Felipe's) messages.
>
> I'm not trying to redo the tests exactly, because (a) I don't have a M1
> machine, and (b) there's not enough details about the hardware and
> configuration to actually redo it properly.
>
> I've focused on quantifying the impact of a couple things mentioned in
> the previous message:
>
> 1) the distance limit
>
> 2) the profiling instrumentation
>
> 3) concurrency (multiple backends doing I/O)
>
> I wrote a couple scripts to run two benchmarks, one focusing on (1) and
> (2), and the second one focusing on (3).
>
> Both were ran on four builds:
>
> 1) master
> 2) patched (index prefetch v11)
> 3) patched-limit (patched + distance limit)
> 4) patched-limit-instrument (patched-limit + instrumentation)
>
> The scripts initialize an instance, vary a couple important parameters
> (shared buffers, io_method, direct I/O, ...) and run index scans on a
> table with either sequential or random data.
>
> I'm attaching the full scripts, raw results, and PDFs with a nicer
> version of the results.
>
>
> single-client test (single-client.tgz)
> --------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method =3D (worker | io_uring)
> * shared_buffers =3D (128MB | 8GB)
> * enable_indexscan_prefetch =3D (on | off)
> * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
>
> This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC
> SN640 960GB NVMe SSD.
>
> The single-client.pdf shows the timings for different combinations of
> parameters, branches and distance limit values. There's also a table
> with timing relative to master (100% means the same as master, green =3D
> good, red =3D bad).
>
> There are literally only 4 cases where prefetching does worse than
> master, and those are for random data with distance limit 1. I claim
> this is irrelevant, because it literally disables prefetching while
> still paying the full cost (all 4 are for io_method=3Dworker, where the
> signal overhead can be high, so it's not a surprise).
>
> We ram up the distance exactly for this reason, that's the solution for
> this overhead problem. I refuse to consider these regressions with
> limit=3D1 a problem. It's a bit like buying a race horse, break its leg
> and then complain it's not running very fast.
>
> The overhead of the instrumentation seems relatively small, probably
> within 5% or so. That's a bit less than I expected, but I still don't
> understand what this is meant to say us. It's measuring wall-time, and
> it's no surprise that in an I/O-bound workload most of the time is spent
> in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
> But it does not give any indication *why*.
>
>
> multi-client test (multi-client.tgz)
> ------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method =3D (worker | io_uring)
> * io_workers =3D (12 | 32)
> * shared_buffers =3D (128MB | 8GB)
> * enable_indexscan_prefetch =3D (on | off)
> * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
> * number of parallel workers (1, 2, 4, 8)
>
> This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO
> 1TB drives in RAID0.
>
> The test prepares a separate table for each worker, and then runs the
> index scans concurrently (and "syncs" the workers to start at the same
> time). It measures the duration, and we can compare it to the timing
> from master (without prefetching).
>
> The multi-client-full.pdf has detailed results for all parameters, but
> as I said I don't think the distance limit (particularly for limit 1) is
> interesting.
>
> The multi-client-simple.pdf shows only results for limit=3D0 (i.e. withou=
t
> limit), and is hopefully easier to understand. The first table shows
> timings for each combination, the second table shows timing relative to
> master (for the same number of workers etc.).
>
> The results are pretty positive. For random data (which is about the
> worst case for I/O), it's consistently faster than master. Yes, the
> gains with 8 workers is not as significant as with 1 worker. For
> example, it may look like this:
>
>                master      prefetch
>    1 worker:     2960          1898       64%
>    8 workers:    5585          5361       96%
>
> But that's not a huge surprise. The storage has a limited throughput,
> and at some point it gets saturated. Whether it's by prefetching, or by
> having multiple workers does not matter.
>
> For sequential data (which is what you did in your examples) it's much
> simpler. For buffered there's not much benefit, because page cache does
> read-ahead with mostly the same effect, or there's a nice consistent
> speedup for direct I/O.
>
> This all seems perfectly fine to me. The bad behavior would be if the
> prefetching gets slower than master, because that would be a regression
> affecting users. But that happens only in 4 cells in the table. My guess
> is it hits some limit on the number of signals the system can process.
> The random data set is not great for this, it's worse with more workers,
> and the 128MB buffers make that even worse. This is a bit of perfect
> storm, and it's already there - bitmap scans can hit that too, AFAICS.
>
> (But I'm speculating, I haven't investigated this in detail yet.)
>
> Moreover, io_uring does not have this issue. Which is another indication
> it's something about the signal overhead.
>
> In any case, these results clearly prefetching can be a huge improvement
> even in environments with concurrent activity, etc.
>
>
> If you see something different on the Mac, you need to investigate why.
> It could be something in the OS, or maybe it it's hardware specific
> thing (consumer SSDs can choke on too many requests). Hard to say. I
> don't even know what kind of M1 machine you have, what SSD etc.
>
>
> regards
>
> --
> Tomas Vondra
>

--000000000000c7709d064bfe0d86
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span class=3D"gmail-im"><div dir=3D"ltr" class=3D"gmail_a=
ttr"><br class=3D"gmail-Apple-interchange-newline">On Sun, Mar 1, 2026 at 3=
:03=E2=80=AFPM Tomas Vondra &lt;<a href=3D"mailto:tomas@vondra.me" target=
=3D"_blank">tomas@vondra.me</a>&gt; wrote:<br></div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex">Hi,<br><br>I&#39;ve decided to run a couple test=
s, trying to reproduce some of the<br>behaviors described in your (Felipe&#=
39;s) messages.<br></blockquote><div><br></div></span><div>Thank you,</div>=
<div>I will look into this data later. I am impressed=C2=A0with the number =
of IO workers=C2=A0</div><div>you used, my test was typically with 3.</div>=
<span class=3D"gmail-im"><div><br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex">I&#39;m not trying to redo the tests exactly, because (a) I =
don&#39;t have a M1<br>machine, and (b) there&#39;s not enough details abou=
t the hardware and<br>configuration to actually redo it properly.<br></bloc=
kquote><div><br></div></span><div>Well I was running on a M1 because this i=
s what I have in front of me</div><div>but I know that any serious database=
 will run on linux.</div><span class=3D"gmail-im"><div></div><div>=C2=A0</d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left:1px solid rgb(204,204,204);padding-left:1ex">I&#39;ve focused on qu=
antifying the impact of a couple things mentioned in<br>the previous messag=
e:</blockquote><div>=C2=A0</div></span><div>I will have a look into this la=
ter and compute the effect size.<br><br></div><span class=3D"gmail-im"><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex">The test varies the following=
 parameters:<br><br>* buffered or direct I/O<br>* io_method =3D (worker | i=
o_uring)<br>* shared_buffers =3D (128MB | 8GB)<br>* enable_indexscan_prefet=
ch =3D (on | off)<br>* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 12=
8)<br>* sequential / random data (1M rows, 550MB, ~15 rows per page)</block=
quote><div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">There are literally only 4 cases where prefetching does worse th=
an<br>master, and those are for random data with distance limit 1. I claim<=
br>this is irrelevant, because it literally disables prefetching while<br>s=
till paying the full cost (all 4 are for io_method=3Dworker, where the<br>s=
ignal overhead can be high, so it&#39;s not a surprise).<br></blockquote><d=
iv>=C2=A0</div></span><div>I agree with your claim, the idea of the distanc=
e limit was to separate</div><div>to have an idea of AIO overhead without t=
he benefit of prefetch, because</div><div>I was seeing very similar results=
, but when I controlled the distance</div><div>the prefetch benefit became =
visible. And also the gradation would=C2=A0</div><div>show if this has a U =
shape or the larger the distance the better the performance.</div><span cla=
ss=3D"gmail-im"><div><br></div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex">It&#39;s a bit like buying a race horse, break its leg<br>and then co=
mplain it&#39;s not running very fast<br></blockquote></span><div>=C2=A0=F0=
=9F=98=82</div><span class=3D"gmail-im"><div><br></div><div><br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex">The overhead of the instrument=
ation seems relatively small, probably<br>within 5% or so. That&#39;s a bit=
 less than I expected, but I still don&#39;t<br>understand what this is mea=
nt to say us. It&#39;s measuring wall-time, and<br>it&#39;s no surprise tha=
t in an I/O-bound workload most of the time is spent<br>in functions doing =
(and waiting for) I/O. Like read_stream_next_buffer.<br>But it does not giv=
e any indication *why*.<br></blockquote><div><br></div></span><div>Well, I =
was hoping to be able to create a self balancing mechanism</div><div>in rea=
d_stream_next_buffer</div><div><br></div><div>=C2=A0/* Do we have to wait f=
or an associated I/O first? */<br>if (stream-&gt;ios_in_progress &gt; 0 &am=
p;&amp;<br>stream-&gt;ios[stream-&gt;oldest_io_index].buffer_index =3D=3D o=
ldest_buffer_index)<br>{<br>=C2=A0 // prefetch and increase the distance wh=
ile we wait here<br>WaitReadBuffers(&amp;stream-&gt;ios[io_index].op);<br>=
=C2=A0...<br>}<br>...<br></div><div>// this call could be removed if we pre=
fetched earlier.</div><div>read_stream_look_ahead(stream);<br></div><div><b=
r></div><div><br></div><div>There same principle that guided the=C2=A0</div=
><div><span style=3D"background-color:rgb(246,248,250);color:rgb(31,35,40);=
font-family:-apple-system,&quot;system-ui&quot;,&quot;Segoe UI&quot;,&quot;=
Noto Sans&quot;,Helvetica,Arial,sans-serif,&quot;Apple Color Emoji&quot;,&q=
uot;Segoe UI Emoji&quot;;font-weight:600">&gt; Don&#39;t wait for already i=
n-progress IO<br><span style=3D"font-size:16px"></span></span></div><div>pa=
tch. Here we should prioritise increasing the distance, and if it is not</d=
iv><div>possible (maybe we consumed all the buffers). We could take the=C2=
=A0</div><div>opportunity to yield.</div><span class=3D"gmail-im"><div><br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex"><br><br>multi-clien=
t test (multi-client.tgz)<br>------------------------------------<br><br>Th=
e test varies the following parameters:<br><br>* buffered or direct I/O<br>=
* io_method =3D (worker | io_uring)<br>* io_workers =3D (12 | 32)<br>* shar=
ed_buffers =3D (128MB | 8GB)<br>* enable_indexscan_prefetch =3D (on | off)<=
br>* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)<br>* sequential=
 / random data (1M rows, 550MB, ~15 rows per page)<br>* number of parallel =
workers (1, 2, 4, 8)<br></blockquote><div>=C2=A0</div></span><div>=C2=A0Are=
 parallel workers here clients issuing queries?</div><span class=3D"gmail-i=
m"><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px=
 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This al=
l seems perfectly fine to me. The bad behavior would be if the<br>prefetchi=
ng gets slower than master, because that would be a regression<br>affecting=
 users. But that happens only in 4 cells in the table.</blockquote><div><br=
></div></span><div>And in this case we have to take the other extremum, and=
 run the queries</div><div>where the prefetch is not expected to help. In t=
his sense I agree with Peter</div><div>that the yielding logic is important=
. We may be limiting the potential of the</div><div>prefetch in some cases =
but excessive reads is the highest risk in my opinion.</div><div>You may kn=
ow better than me, but I talk about the workloads I have seen</div><div>or =
worked with, it is typically a high number of small queries. Not these huge=
</div><div>scans.=C2=A0</div><div>Large queries are rare, and when they com=
e to our attention is because</div><div>they used too much memory and start=
ed to create temporary files.</div><span class=3D"gmail-im"><div><br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">(But I&#39;m speculating,=
 I haven&#39;t investigated this in detail yet.)<br></blockquote><div><br><=
/div></span><div>Fair enough.</div><span class=3D"gmail-im"><div><br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Moreover, io_uring does n=
ot have this issue. Which is another indication<br>it&#39;s something about=
 the signal overhead.<br></blockquote><div>=C2=A0</div></span><div>That is =
interesting.</div><span class=3D"gmail-im"><div>=C2=A0</div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">In any case, these results clearly prefe=
tching can be a huge improvement<br>even in environments with concurrent ac=
tivity, etc.<br><br><br>If you see something different on the Mac, you need=
 to investigate why.<br>It could be something in the OS, or maybe it it&#39=
;s hardware specific<br>thing (consumer SSDs can choke on too many requests=
). Hard to say. I<br>don&#39;t even know what kind of M1 machine you have, =
what SSD etc.<br></blockquote><div><br></div></span><div>My guess is that t=
he cause is IPC, I don&#39;t know well how the=C2=A0</div><div>async IO wor=
ks, but if it is a different process I think that MacOS is</div><div>by les=
s efficient than linux.=C2=A0 But I don&#39;t know how to measure that.</di=
v><div><br></div><div>Regards,</div><div>Alexandre</div></div><br><div clas=
s=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_att=
r">On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra &lt;<a href=3D"mailt=
o:tomas@vondra.me">tomas@vondra.me</a>&gt; wrote:<br></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I&#39;ve decided to run a couple tests, trying to reproduce some of the<br>
behaviors described in your (Felipe&#39;s) messages.<br>
<br>
I&#39;m not trying to redo the tests exactly, because (a) I don&#39;t have =
a M1<br>
machine, and (b) there&#39;s not enough details about the hardware and<br>
configuration to actually redo it properly.<br>
<br>
I&#39;ve focused on quantifying the impact of a couple things mentioned in<=
br>
the previous message:<br>
<br>
1) the distance limit<br>
<br>
2) the profiling instrumentation<br>
<br>
3) concurrency (multiple backends doing I/O)<br>
<br>
I wrote a couple scripts to run two benchmarks, one focusing on (1) and<br>
(2), and the second one focusing on (3).<br>
<br>
Both were ran on four builds:<br>
<br>
1) master<br>
2) patched (index prefetch v11)<br>
3) patched-limit (patched + distance limit)<br>
4) patched-limit-instrument (patched-limit + instrumentation)<br>
<br>
The scripts initialize an instance, vary a couple important parameters<br>
(shared buffers, io_method, direct I/O, ...) and run index scans on a<br>
table with either sequential or random data.<br>
<br>
I&#39;m attaching the full scripts, raw results, and PDFs with a nicer<br>
version of the results.<br>
<br>
<br>
single-client test (single-client.tgz)<br>
--------------------------------------<br>
<br>
The test varies the following parameters:<br>
<br>
* buffered or direct I/O<br>
* io_method =3D (worker | io_uring)<br>
* shared_buffers =3D (128MB | 8GB)<br>
* enable_indexscan_prefetch =3D (on | off)<br>
* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)<br>
* sequential / random data (1M rows, 550MB, ~15 rows per page)<br>
<br>
This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC<br>
SN640 960GB NVMe SSD.<br>
<br>
The single-client.pdf shows the timings for different combinations of<br>
parameters, branches and distance limit values. There&#39;s also a table<br=
>
with timing relative to master (100% means the same as master, green =3D<br=
>
good, red =3D bad).<br>
<br>
There are literally only 4 cases where prefetching does worse than<br>
master, and those are for random data with distance limit 1. I claim<br>
this is irrelevant, because it literally disables prefetching while<br>
still paying the full cost (all 4 are for io_method=3Dworker, where the<br>
signal overhead can be high, so it&#39;s not a surprise).<br>
<br>
We ram up the distance exactly for this reason, that&#39;s the solution for=
<br>
this overhead problem. I refuse to consider these regressions with<br>
limit=3D1 a problem. It&#39;s a bit like buying a race horse, break its leg=
<br>
and then complain it&#39;s not running very fast.<br>
<br>
The overhead of the instrumentation seems relatively small, probably<br>
within 5% or so. That&#39;s a bit less than I expected, but I still don&#39=
;t<br>
understand what this is meant to say us. It&#39;s measuring wall-time, and<=
br>
it&#39;s no surprise that in an I/O-bound workload most of the time is spen=
t<br>
in functions doing (and waiting for) I/O. Like read_stream_next_buffer.<br>
But it does not give any indication *why*.<br>
<br>
<br>
multi-client test (multi-client.tgz)<br>
------------------------------------<br>
<br>
The test varies the following parameters:<br>
<br>
* buffered or direct I/O<br>
* io_method =3D (worker | io_uring)<br>
* io_workers =3D (12 | 32)<br>
* shared_buffers =3D (128MB | 8GB)<br>
* enable_indexscan_prefetch =3D (on | off)<br>
* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)<br>
* sequential / random data (1M rows, 550MB, ~15 rows per page)<br>
* number of parallel workers (1, 2, 4, 8)<br>
<br>
This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO<br>
1TB drives in RAID0.<br>
<br>
The test prepares a separate table for each worker, and then runs the<br>
index scans concurrently (and &quot;syncs&quot; the workers to start at the=
 same<br>
time). It measures the duration, and we can compare it to the timing<br>
from master (without prefetching).<br>
<br>
The multi-client-full.pdf has detailed results for all parameters, but<br>
as I said I don&#39;t think the distance limit (particularly for limit 1) i=
s<br>
interesting.<br>
<br>
The multi-client-simple.pdf shows only results for limit=3D0 (i.e. without<=
br>
limit), and is hopefully easier to understand. The first table shows<br>
timings for each combination, the second table shows timing relative to<br>
master (for the same number of workers etc.).<br>
<br>
The results are pretty positive. For random data (which is about the<br>
worst case for I/O), it&#39;s consistently faster than master. Yes, the<br>
gains with 8 workers is not as significant as with 1 worker. For<br>
example, it may look like this:<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0master=C2=A0 =C2=A0 =
=C2=A0 prefetch<br>
=C2=A0 =C2=A01 worker:=C2=A0 =C2=A0 =C2=A02960=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 1898=C2=A0 =C2=A0 =C2=A0 =C2=A064%<br>
=C2=A0 =C2=A08 workers:=C2=A0 =C2=A0 5585=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 5361=C2=A0 =C2=A0 =C2=A0 =C2=A096%<br>
<br>
But that&#39;s not a huge surprise. The storage has a limited throughput,<b=
r>
and at some point it gets saturated. Whether it&#39;s by prefetching, or by=
<br>
having multiple workers does not matter.<br>
<br>
For sequential data (which is what you did in your examples) it&#39;s much<=
br>
simpler. For buffered there&#39;s not much benefit, because page cache does=
<br>
read-ahead with mostly the same effect, or there&#39;s a nice consistent<br=
>
speedup for direct I/O.<br>
<br>
This all seems perfectly fine to me. The bad behavior would be if the<br>
prefetching gets slower than master, because that would be a regression<br>
affecting users. But that happens only in 4 cells in the table. My guess<br=
>
is it hits some limit on the number of signals the system can process.<br>
The random data set is not great for this, it&#39;s worse with more workers=
,<br>
and the 128MB buffers make that even worse. This is a bit of perfect<br>
storm, and it&#39;s already there - bitmap scans can hit that too, AFAICS.<=
br>
<br>
(But I&#39;m speculating, I haven&#39;t investigated this in detail yet.)<b=
r>
<br>
Moreover, io_uring does not have this issue. Which is another indication<br=
>
it&#39;s something about the signal overhead.<br>
<br>
In any case, these results clearly prefetching can be a huge improvement<br=
>
even in environments with concurrent activity, etc.<br>
<br>
<br>
If you see something different on the Mac, you need to investigate why.<br>
It could be something in the OS, or maybe it it&#39;s hardware specific<br>
thing (consumer SSDs can choke on too many requests). Hard to say. I<br>
don&#39;t even know what kind of M1 machine you have, what SSD etc.<br>
<br>
<br>
regards<br>
<br>
-- <br>
Tomas Vondra<br>
</blockquote></div>

--000000000000c7709d064bfe0d86--