Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwpLc-00Ffdh-1u for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 22:33:12 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vwpLa-00EEqD-1S for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 22:33:10 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwpLZ-00EEq5-2x for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 22:33:10 +0000 Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vwpLV-0000000200m-3JOs for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 22:33:09 +0000 Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-65f89c40547so6945089a12.1 for ; Sun, 01 Mar 2026 14:33:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772404385; cv=none; d=google.com; s=arc-20240605; b=KE3b23cXVwpOa+BtRk3S7C0kKfpWQBCWaDujWVFOyoaHfnOca7h/0Kno5qdqL+Uzjg 2YPx/NxzaUrZo/j1QNPrAaaASQiCTJkZufBqsXmwMNM6vL6mfr6LEbHwKhYVIoYJCCOX DvxZuFw+L+1Znbwxds7pGb/ouGgakKwpHMu4QUqLaTTL5/ryfxuHthW1jPa7SNoFebEF iDtEGbH0Qwo+TVQOzVm4GG0d4+KDrjE/HeA2Em6yS51OY0btv5WAw/h3YGMXnwJzeg8T efZvSQFbHT+18tuI5l3+UlmMlWnrflRMoeAH/vRMzX/WsVjyR0QI1xwJ0zJW3LgllAUE V8AA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=VL0j1ScLrQ8XFuXgSOrnSCGI6xUklulex/bSCvI6GIQ=; fh=dEYM4fUK7hSx/dxIbj1TRBNh1DlBhX0ofr0XI6j3VbE=; b=TFFxjm/DuO/I2qoc43GFigHtjKL+pR9zTh9kv6gWeUC3Xlc+sZ2sKKOlwKXWLVUth+ Gl/VFfNcBg0FQ3ofJi3C1wrt3rK3ohSGaBPlEgHJVja9BTA43fvO6d+oR0C3DQM2FpUx Y0zSckIpeqIz4J3QAKYtiGslSXN2mKt8+CFrSjaQGIkbkpbdVNn9S+tQMGiQUMuD9QzS YA1NGmbb+XLLt+g9opjl3L18ePva84ch+wB69eDSGEcqgN8SF3LDFtrgmHrOtACW76V4 sn90ii32LzjcjMZKQgTy0L/2c9HFS+d9vQlXsoH4LKMOePpaiXJ4Atihj/Mk87DDtluj uO5g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772404385; x=1773009185; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VL0j1ScLrQ8XFuXgSOrnSCGI6xUklulex/bSCvI6GIQ=; b=BAjdsK/69QXir2nJHHzBeDBo3qe3qDuZSS6A6dPU389ECpxUrpvaNL/DxJLMKKfLI0 GlGZgkHftLPk8/JaoM9MAYJQ7sN8vvTp/ea3r6JlZBqwpIJ0stuyeDnq7w8pvUcemP2B YLZb2Kiu95Ur0y3uW8PHan5Q1PT0aklg7sAcXg2Y4fv9mGvghC+GFhV7qYv06AwcOwVT ZQqWn2zToTXnrrFlBseb8XcPmgb4Wi/7OsYey/aRJtPO62EVA4thFTSGwZ9b6udP3UFF 725k7vybhggUZ5NgxImKwxhk/UrZnIHrcAUco2f1Jgn9bEp+RhRxoGBkwAmfCF4rzrDR BaMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772404385; x=1773009185; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=VL0j1ScLrQ8XFuXgSOrnSCGI6xUklulex/bSCvI6GIQ=; b=BXyHmDiUTsT7PF8hJCHC9pI9MPrMISWvSsbYENs/vK71ucd5WLX0rDxJcX4G/NWGyV NtuJiOCeHFDC2f8mvVHpLne5r2SXffZYSXP3RxqAsCpSPTH2n1rH2GvEk/aYZ6u0cvnM NGmN8xR7QXPvPvpBooPTZ1rG7N2s0Zy/+rfJ+aN3ePgr+b8VQTbU/QTQVJxXBDmA1lBK 7yQduacX5lLvfRXjYWed+ffCMqSQFQvQKfQ+msrCkIBkeHaClYgd08jNzI0qFBKL59bT a7Sa30M1kGZUz+oWMWtd+xNs04PwpGUX76Vp8ZgavAd/gsRYrUxwBwByc2J5BDjMURC1 8wEw== X-Forwarded-Encrypted: i=1; AJvYcCWNqEl8RunV6TG0+jz+6CPr5uS5jTuNBXsGDOx9aawYgX0HNUPHEflRxpc4gdl/5w62MvOo+5zSawkp6Vo7@lists.postgresql.org X-Gm-Message-State: AOJu0YxZou79D6Kht3qIYRZ72KaDmwX3lZkWGKnBytxQbNJdWwfaoKiV jvdZTKB70Fx8OUIPMRMCnRJLZylomI/gej7mkpi24ejJoXXSKOq5fPZArz0Xinz6Ijjwe17bd2u ySS/vKUv9CBSyscOT6nAcBGNJFJn9nY0= X-Gm-Gg: ATEYQzyHG18Jy8ECX1Mu/QQhwplDSq45Z5Wa02/GpXtjDXVDPUpR1oVUl1ptvRniYOd udsvlNHRg59kmyUKEVw5fSGnGfDzRHuvm1aIdhLsXtY4KGKnr/q9BzYY6i9/hXLaC+a93nYvGpr 1ZSTAQrGjG7t/6gisq+Fq/YSxv2veRNS7oSz5m+k43cZidi6RPefGqaDPiXEQBMpZur8wlk8k7q yhxvphbl3XlttmWf/lYCFABWepEKKpODerqTNoRL5rYh2vtQdRaOBVJP+X6nQC6W+nFX6j8jCj6 gmuRC/ovhRvkNUfyQcmohT4cxR79ihcXuDY4583jz0YgEADkSA== X-Received: by 2002:a17:907:78e:b0:b8f:dec3:6606 with SMTP id a640c23a62f3a-b937595fe4bmr566055866b.23.1772404385213; Sun, 01 Mar 2026 14:33:05 -0800 (PST) MIME-Version: 1.0 References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> <7e707787-272a-4c52-b5f1-5ac990514ecc@vondra.me> In-Reply-To: From: Alexandre Felipe Date: Sun, 1 Mar 2026 22:32:53 +0000 X-Gm-Features: AaiRm53lgkl2k53qPJclIO7NllyHiV-0utI7HlW0W38gStpNWmw5XdDgfmyFFEg Message-ID: Subject: Re: index prefetching To: Tomas Vondra Cc: Andres Freund , Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: multipart/alternative; boundary="000000000000c7709d064bfe0d86" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c7709d064bfe0d86 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra wrote= : > Hi, > > I've decided to run a couple tests, trying to reproduce some of the > behaviors described in your (Felipe's) messages. > Thank you, I will look into this data later. I am impressed with the number of IO workers you used, my test was typically with 3. I'm not trying to redo the tests exactly, because (a) I don't have a M1 > machine, and (b) there's not enough details about the hardware and > configuration to actually redo it properly. > Well I was running on a M1 because this is what I have in front of me but I know that any serious database will run on linux. > I've focused on quantifying the impact of a couple things mentioned in > the previous message: I will have a look into this later and compute the effect size. The test varies the following parameters: > > * buffered or direct I/O > * io_method =3D (worker | io_uring) > * shared_buffers =3D (128MB | 8GB) > * enable_indexscan_prefetch =3D (on | off) > * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) There are literally only 4 cases where prefetching does worse than > master, and those are for random data with distance limit 1. I claim > this is irrelevant, because it literally disables prefetching while > still paying the full cost (all 4 are for io_method=3Dworker, where the > signal overhead can be high, so it's not a surprise). > I agree with your claim, the idea of the distance limit was to separate to have an idea of AIO overhead without the benefit of prefetch, because I was seeing very similar results, but when I controlled the distance the prefetch benefit became visible. And also the gradation would show if this has a U shape or the larger the distance the better the performance. It's a bit like buying a race horse, break its leg > and then complain it's not running very fast > =F0=9F=98=82 The overhead of the instrumentation seems relatively small, probably > within 5% or so. That's a bit less than I expected, but I still don't > understand what this is meant to say us. It's measuring wall-time, and > it's no surprise that in an I/O-bound workload most of the time is spent > in functions doing (and waiting for) I/O. Like read_stream_next_buffer. > But it does not give any indication *why*. > Well, I was hoping to be able to create a self balancing mechanism in read_stream_next_buffer /* Do we have to wait for an associated I/O first? */ if (stream->ios_in_progress > 0 && stream->ios[stream->oldest_io_index].buffer_index =3D=3D oldest_buffer_inde= x) { // prefetch and increase the distance while we wait here WaitReadBuffers(&stream->ios[io_index].op); ... } ... // this call could be removed if we prefetched earlier. read_stream_look_ahead(stream); There same principle that guided the > Don't wait for already in-progress IO patch. Here we should prioritise increasing the distance, and if it is not possible (maybe we consumed all the buffers). We could take the opportunity to yield. > > multi-client test (multi-client.tgz) > ------------------------------------ > > The test varies the following parameters: > > * buffered or direct I/O > * io_method =3D (worker | io_uring) > * io_workers =3D (12 | 32) > * shared_buffers =3D (128MB | 8GB) > * enable_indexscan_prefetch =3D (on | off) > * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) > * number of parallel workers (1, 2, 4, 8) > Are parallel workers here clients issuing queries? This all seems perfectly fine to me. The bad behavior would be if the > prefetching gets slower than master, because that would be a regression > affecting users. But that happens only in 4 cells in the table. And in this case we have to take the other extremum, and run the queries where the prefetch is not expected to help. In this sense I agree with Pete= r that the yielding logic is important. We may be limiting the potential of the prefetch in some cases but excessive reads is the highest risk in my opinion. You may know better than me, but I talk about the workloads I have seen or worked with, it is typically a high number of small queries. Not these huge scans. Large queries are rare, and when they come to our attention is because they used too much memory and started to create temporary files. (But I'm speculating, I haven't investigated this in detail yet.) > Fair enough. Moreover, io_uring does not have this issue. Which is another indication > it's something about the signal overhead. > That is interesting. > In any case, these results clearly prefetching can be a huge improvement > even in environments with concurrent activity, etc. > > > If you see something different on the Mac, you need to investigate why. > It could be something in the OS, or maybe it it's hardware specific > thing (consumer SSDs can choke on too many requests). Hard to say. I > don't even know what kind of M1 machine you have, what SSD etc. > My guess is that the cause is IPC, I don't know well how the async IO works, but if it is a different process I think that MacOS is by less efficient than linux. But I don't know how to measure that. Regards, Alexandre On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra wrote= : > Hi, > > I've decided to run a couple tests, trying to reproduce some of the > behaviors described in your (Felipe's) messages. > > I'm not trying to redo the tests exactly, because (a) I don't have a M1 > machine, and (b) there's not enough details about the hardware and > configuration to actually redo it properly. > > I've focused on quantifying the impact of a couple things mentioned in > the previous message: > > 1) the distance limit > > 2) the profiling instrumentation > > 3) concurrency (multiple backends doing I/O) > > I wrote a couple scripts to run two benchmarks, one focusing on (1) and > (2), and the second one focusing on (3). > > Both were ran on four builds: > > 1) master > 2) patched (index prefetch v11) > 3) patched-limit (patched + distance limit) > 4) patched-limit-instrument (patched-limit + instrumentation) > > The scripts initialize an instance, vary a couple important parameters > (shared buffers, io_method, direct I/O, ...) and run index scans on a > table with either sequential or random data. > > I'm attaching the full scripts, raw results, and PDFs with a nicer > version of the results. > > > single-client test (single-client.tgz) > -------------------------------------- > > The test varies the following parameters: > > * buffered or direct I/O > * io_method =3D (worker | io_uring) > * shared_buffers =3D (128MB | 8GB) > * enable_indexscan_prefetch =3D (on | off) > * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) > > This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC > SN640 960GB NVMe SSD. > > The single-client.pdf shows the timings for different combinations of > parameters, branches and distance limit values. There's also a table > with timing relative to master (100% means the same as master, green =3D > good, red =3D bad). > > There are literally only 4 cases where prefetching does worse than > master, and those are for random data with distance limit 1. I claim > this is irrelevant, because it literally disables prefetching while > still paying the full cost (all 4 are for io_method=3Dworker, where the > signal overhead can be high, so it's not a surprise). > > We ram up the distance exactly for this reason, that's the solution for > this overhead problem. I refuse to consider these regressions with > limit=3D1 a problem. It's a bit like buying a race horse, break its leg > and then complain it's not running very fast. > > The overhead of the instrumentation seems relatively small, probably > within 5% or so. That's a bit less than I expected, but I still don't > understand what this is meant to say us. It's measuring wall-time, and > it's no surprise that in an I/O-bound workload most of the time is spent > in functions doing (and waiting for) I/O. Like read_stream_next_buffer. > But it does not give any indication *why*. > > > multi-client test (multi-client.tgz) > ------------------------------------ > > The test varies the following parameters: > > * buffered or direct I/O > * io_method =3D (worker | io_uring) > * io_workers =3D (12 | 32) > * shared_buffers =3D (128MB | 8GB) > * enable_indexscan_prefetch =3D (on | off) > * indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128) > * sequential / random data (1M rows, 550MB, ~15 rows per page) > * number of parallel workers (1, 2, 4, 8) > > This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO > 1TB drives in RAID0. > > The test prepares a separate table for each worker, and then runs the > index scans concurrently (and "syncs" the workers to start at the same > time). It measures the duration, and we can compare it to the timing > from master (without prefetching). > > The multi-client-full.pdf has detailed results for all parameters, but > as I said I don't think the distance limit (particularly for limit 1) is > interesting. > > The multi-client-simple.pdf shows only results for limit=3D0 (i.e. withou= t > limit), and is hopefully easier to understand. The first table shows > timings for each combination, the second table shows timing relative to > master (for the same number of workers etc.). > > The results are pretty positive. For random data (which is about the > worst case for I/O), it's consistently faster than master. Yes, the > gains with 8 workers is not as significant as with 1 worker. For > example, it may look like this: > > master prefetch > 1 worker: 2960 1898 64% > 8 workers: 5585 5361 96% > > But that's not a huge surprise. The storage has a limited throughput, > and at some point it gets saturated. Whether it's by prefetching, or by > having multiple workers does not matter. > > For sequential data (which is what you did in your examples) it's much > simpler. For buffered there's not much benefit, because page cache does > read-ahead with mostly the same effect, or there's a nice consistent > speedup for direct I/O. > > This all seems perfectly fine to me. The bad behavior would be if the > prefetching gets slower than master, because that would be a regression > affecting users. But that happens only in 4 cells in the table. My guess > is it hits some limit on the number of signals the system can process. > The random data set is not great for this, it's worse with more workers, > and the 128MB buffers make that even worse. This is a bit of perfect > storm, and it's already there - bitmap scans can hit that too, AFAICS. > > (But I'm speculating, I haven't investigated this in detail yet.) > > Moreover, io_uring does not have this issue. Which is another indication > it's something about the signal overhead. > > In any case, these results clearly prefetching can be a huge improvement > even in environments with concurrent activity, etc. > > > If you see something different on the Mac, you need to investigate why. > It could be something in the OS, or maybe it it's hardware specific > thing (consumer SSDs can choke on too many requests). Hard to say. I > don't even know what kind of M1 machine you have, what SSD etc. > > > regards > > -- > Tomas Vondra > --000000000000c7709d064bfe0d86 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On Sun, Mar 1, 2026 at 3= :03=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote:
Hi,

I've decided to run a couple test= s, trying to reproduce some of the
behaviors described in your (Felipe&#= 39;s) messages.

Thank you,
=
I will look into this data later. I am impressed=C2=A0with the number = of IO workers=C2=A0
you used, my test was typically with 3.
=

I'm not trying to redo the tests exactly, because (a) I = don't have a M1
machine, and (b) there's not enough details abou= t the hardware and
configuration to actually redo it properly.

Well I was running on a M1 because this i= s what I have in front of me
but I know that any serious database= will run on linux.
=C2=A0
I've focused on qu= antifying the impact of a couple things mentioned in
the previous messag= e:
=C2=A0
I will have a look into this la= ter and compute the effect size.

The test varies the following= parameters:

* buffered or direct I/O
* io_method =3D (worker | i= o_uring)
* shared_buffers =3D (128MB | 8GB)
* enable_indexscan_prefet= ch =3D (on | off)
* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 12= 8)
* sequential / random data (1M rows, 550MB, ~15 rows per page)


There are literally only 4 cases where prefetching does worse th= an
master, and those are for random data with distance limit 1. I claim<= br>this is irrelevant, because it literally disables prefetching while
s= till paying the full cost (all 4 are for io_method=3Dworker, where the
s= ignal overhead can be high, so it's not a surprise).
=C2=A0
I agree with your claim, the idea of the distanc= e limit was to separate
to have an idea of AIO overhead without t= he benefit of prefetch, because
I was seeing very similar results= , but when I controlled the distance
the prefetch benefit became = visible. And also the gradation would=C2=A0
show if this has a U = shape or the larger the distance the better the performance.

It's a bit like buying a race horse, break its leg
and then co= mplain it's not running very fast
=C2=A0=F0= =9F=98=82


The overhead of the instrument= ation seems relatively small, probably
within 5% or so. That's a bit= less than I expected, but I still don't
understand what this is mea= nt to say us. It's measuring wall-time, and
it's no surprise tha= t in an I/O-bound workload most of the time is spent
in functions doing = (and waiting for) I/O. Like read_stream_next_buffer.
But it does not giv= e any indication *why*.

Well, I = was hoping to be able to create a self balancing mechanism
in rea= d_stream_next_buffer

=C2=A0/* Do we have to wait f= or an associated I/O first? */
if (stream->ios_in_progress > 0 &am= p;&
stream->ios[stream->oldest_io_index].buffer_index =3D=3D o= ldest_buffer_index)
{
=C2=A0 // prefetch and increase the distance wh= ile we wait here
WaitReadBuffers(&stream->ios[io_index].op);
= =C2=A0...
}
...
// this call could be removed if we pre= fetched earlier.
read_stream_look_ahead(stream);

There same principle that guided the=C2=A0
> Don't wait for already i= n-progress IO
pa= tch. Here we should prioritise increasing the distance, and if it is not
possible (maybe we consumed all the buffers). We could take the=C2= =A0
opportunity to yield.

=


multi-clien= t test (multi-client.tgz)
------------------------------------

Th= e test varies the following parameters:

* buffered or direct I/O
= * io_method =3D (worker | io_uring)
* io_workers =3D (12 | 32)
* shar= ed_buffers =3D (128MB | 8GB)
* enable_indexscan_prefetch =3D (on | off)<= br>* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
* sequential= / random data (1M rows, 550MB, ~15 rows per page)
* number of parallel = workers (1, 2, 4, 8)
=C2=A0
=C2=A0Are= parallel workers here clients issuing queries?

This al= l seems perfectly fine to me. The bad behavior would be if the
prefetchi= ng gets slower than master, because that would be a regression
affecting= users. But that happens only in 4 cells in the table.
And in this case we have to take the other extremum, and= run the queries
where the prefetch is not expected to help. In t= his sense I agree with Peter
that the yielding logic is important= . We may be limiting the potential of the
prefetch in some cases = but excessive reads is the highest risk in my opinion.
You may kn= ow better than me, but I talk about the workloads I have seen
or = worked with, it is typically a high number of small queries. Not these huge=
scans.=C2=A0
Large queries are rare, and when they com= e to our attention is because
they used too much memory and start= ed to create temporary files.

=
(But I'm speculating,= I haven't investigated this in detail yet.)

<= /div>
Fair enough.

=
Moreover, io_uring does n= ot have this issue. Which is another indication
it's something about= the signal overhead.
=C2=A0
That is = interesting.
=C2=A0
In any case, these results clearly prefe= tching can be a huge improvement
even in environments with concurrent ac= tivity, etc.


If you see something different on the Mac, you need= to investigate why.
It could be something in the OS, or maybe it it'= ;s hardware specific
thing (consumer SSDs can choke on too many requests= ). Hard to say. I
don't even know what kind of M1 machine you have, = what SSD etc.

My guess is that t= he cause is IPC, I don't know well how the=C2=A0
async IO wor= ks, but if it is a different process I think that MacOS is
by les= s efficient than linux.=C2=A0 But I don't know how to measure that.

Regards,
Alexandre

On Sun, Mar 1, 2026 at 3:03=E2=80=AFPM Tomas Vondra <tomas@vondra.me> wrote:
Hi,

I've decided to run a couple tests, trying to reproduce some of the
behaviors described in your (Felipe's) messages.

I'm not trying to redo the tests exactly, because (a) I don't have = a M1
machine, and (b) there's not enough details about the hardware and
configuration to actually redo it properly.

I've focused on quantifying the impact of a couple things mentioned in<= br> the previous message:

1) the distance limit

2) the profiling instrumentation

3) concurrency (multiple backends doing I/O)

I wrote a couple scripts to run two benchmarks, one focusing on (1) and
(2), and the second one focusing on (3).

Both were ran on four builds:

1) master
2) patched (index prefetch v11)
3) patched-limit (patched + distance limit)
4) patched-limit-instrument (patched-limit + instrumentation)

The scripts initialize an instance, vary a couple important parameters
(shared buffers, io_method, direct I/O, ...) and run index scans on a
table with either sequential or random data.

I'm attaching the full scripts, raw results, and PDFs with a nicer
version of the results.


single-client test (single-client.tgz)
--------------------------------------

The test varies the following parameters:

* buffered or direct I/O
* io_method =3D (worker | io_uring)
* shared_buffers =3D (128MB | 8GB)
* enable_indexscan_prefetch =3D (on | off)
* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
* sequential / random data (1M rows, 550MB, ~15 rows per page)

This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC
SN640 960GB NVMe SSD.

The single-client.pdf shows the timings for different combinations of
parameters, branches and distance limit values. There's also a table with timing relative to master (100% means the same as master, green =3D good, red =3D bad).

There are literally only 4 cases where prefetching does worse than
master, and those are for random data with distance limit 1. I claim
this is irrelevant, because it literally disables prefetching while
still paying the full cost (all 4 are for io_method=3Dworker, where the
signal overhead can be high, so it's not a surprise).

We ram up the distance exactly for this reason, that's the solution for=
this overhead problem. I refuse to consider these regressions with
limit=3D1 a problem. It's a bit like buying a race horse, break its leg=
and then complain it's not running very fast.

The overhead of the instrumentation seems relatively small, probably
within 5% or so. That's a bit less than I expected, but I still don'= ;t
understand what this is meant to say us. It's measuring wall-time, and<= br> it's no surprise that in an I/O-bound workload most of the time is spen= t
in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
But it does not give any indication *why*.


multi-client test (multi-client.tgz)
------------------------------------

The test varies the following parameters:

* buffered or direct I/O
* io_method =3D (worker | io_uring)
* io_workers =3D (12 | 32)
* shared_buffers =3D (128MB | 8GB)
* enable_indexscan_prefetch =3D (on | off)
* indexscan_prefetch_distance =3D (0, 1, 4, 16, 64, 128)
* sequential / random data (1M rows, 550MB, ~15 rows per page)
* number of parallel workers (1, 2, 4, 8)

This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO
1TB drives in RAID0.

The test prepares a separate table for each worker, and then runs the
index scans concurrently (and "syncs" the workers to start at the= same
time). It measures the duration, and we can compare it to the timing
from master (without prefetching).

The multi-client-full.pdf has detailed results for all parameters, but
as I said I don't think the distance limit (particularly for limit 1) i= s
interesting.

The multi-client-simple.pdf shows only results for limit=3D0 (i.e. without<= br> limit), and is hopefully easier to understand. The first table shows
timings for each combination, the second table shows timing relative to
master (for the same number of workers etc.).

The results are pretty positive. For random data (which is about the
worst case for I/O), it's consistently faster than master. Yes, the
gains with 8 workers is not as significant as with 1 worker. For
example, it may look like this:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0master=C2=A0 =C2=A0 = =C2=A0 prefetch
=C2=A0 =C2=A01 worker:=C2=A0 =C2=A0 =C2=A02960=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 1898=C2=A0 =C2=A0 =C2=A0 =C2=A064%
=C2=A0 =C2=A08 workers:=C2=A0 =C2=A0 5585=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 5361=C2=A0 =C2=A0 =C2=A0 =C2=A096%

But that's not a huge surprise. The storage has a limited throughput, and at some point it gets saturated. Whether it's by prefetching, or by=
having multiple workers does not matter.

For sequential data (which is what you did in your examples) it's much<= br> simpler. For buffered there's not much benefit, because page cache does=
read-ahead with mostly the same effect, or there's a nice consistent speedup for direct I/O.

This all seems perfectly fine to me. The bad behavior would be if the
prefetching gets slower than master, because that would be a regression
affecting users. But that happens only in 4 cells in the table. My guess is it hits some limit on the number of signals the system can process.
The random data set is not great for this, it's worse with more workers= ,
and the 128MB buffers make that even worse. This is a bit of perfect
storm, and it's already there - bitmap scans can hit that too, AFAICS.<= br>
(But I'm speculating, I haven't investigated this in detail yet.)
Moreover, io_uring does not have this issue. Which is another indication it's something about the signal overhead.

In any case, these results clearly prefetching can be a huge improvement even in environments with concurrent activity, etc.


If you see something different on the Mac, you need to investigate why.
It could be something in the OS, or maybe it it's hardware specific
thing (consumer SSDs can choke on too many requests). Hard to say. I
don't even know what kind of M1 machine you have, what SSD etc.


regards

--
Tomas Vondra
--000000000000c7709d064bfe0d86--