Feedback-ID: id4a34324:Fastmail
Date: Mon, 16 Feb 2026 12:15:20 -0500
From: Andres Freund <andres@anarazel.de>
To: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Cc: Tomas Vondra <tomas@vondra.me>, Peter Geoghegan <pg@bowt.ie>, 
	Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, 
	Robert Haas <robertmhaas@gmail.com>, Melanie Plageman <melanieplageman@gmail.com>, 
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, 
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Subject: Re: index prefetching
Message-ID: <uwwucxl5psl5ycwnebhn3pwfyb7jdjrrwgif2yqqtboeuscsfh@lo33ijtfdrbd>
References: <CAH2-WzmH7pVQ0-mYAxb82aWbz29_BiBPq2wV5p7+1o2sRFqDRQ@mail.gmail.com>
 <CAH2-Wz=6a7fGz2rALDX+xiFDuEaGQWpZ49xEaBUDKiPH8gcL+Q@mail.gmail.com>
 <CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com>
 <CAH2-WzmymSyOt5Y2RGbm6cJXg18J_ttfqjdcpodHe6Gp23ConQ@mail.gmail.com>
 <CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com>
 <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl>
 <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me>
 <CAE8JnxNOV9kOgmU1-WUWts9Q-Jj_Nf0K480wyEwJXUQYMnYu3g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAE8JnxNOV9kOgmU1-WUWts9Q-Jj_Nf0K480wyEwJXUQYMnYu3g@mail.gmail.com>
Archived-At: <https://www.postgresql.org/message-id/uwwucxl5psl5ycwnebhn3pwfyb7jdjrrwgif2yqqtboeuscsfh%40lo33ijtfdrbd>
Precedence: bulk

Hi,

On 2026-02-16 05:30:00 +0000, Alexandre Felipe wrote:
> ## Large table MacOS test
> 
> I did a 1000x larger test in MacOS, with the sql script . prefetch had
> negligible effect for random and periodic, and made sequential 16% slower
> [a].

Quite surprised by that, I don't see that at all.


> ## Small scale linux test.
>
> I have repeated the tests with a 10k table row on a linux system, this time
> I tried using either a SSD or a HDD, with shared_buffers of either 128kB or
> 128MB, and either psql or python with psycopg. I used a slightly different
> table [b].

128kB is a useless test. That's 16 buffers. Just about any real workload will
trigger errors about having too many buffer pinned, independent of this
patch. We support it because it's useful for some testing, not because anybody
should ever use postgres that way.

We should not spend a single second optimizing performance on such toy setups.


Even 128MB borders on irrelevant for anything performance sensitive. It
actually will limit how much AIO you can do (due to now allowing one session
to pin all buffers). If you have a workload that is at all IO bound, you
should use a significantly larger s_b.


> Prefetch makes...
>
> SDD with low available buffers is significantly slower, except for
> sequential reads.
> SDD with cold OS reads is significantly faster for random access.
> HDD faster, but not reliably (high variance).
>
> I am not questioning the usefulness of the patch, and I know that there is
> a log of work already put into it. The reason why I decided to review it is
> because I believe this can be one important step forward. But I hope that
> it is not hard to agree that these (counter)examples suggest that there are
> some edges to be pruned. Where I work, most of the queries will access
> at most a few hundred lines and are expected to be perceived as
> instantaneous.

I think some of the tests do show something useful (like readahead not
happening for some cases where it should).  But testing with pointless
configurations is not useful.


> If the tests I am doing are pointless, should we consider having something
> in the planner to prevent these scans from using prefetch?

No, that'd just be wasted cycles.


> Should we introduce centralized coordination for IO?

I don't think it's any of the next few steps. The amount of contention you can
create by doing that is significant, so you have to be really careful.


There *are* interesting bits around more central coordination of IOs:

I'd e.g. love allowing autovacuum to do IO more aggressively as long as the
system otherwise is idle IO wise. That's much better than having an upper cap
on how much IO AV is allowed to do at all time, because it allows to use spare
IO capacity when available *and* scales down AV activity when the system is
otherwise very busy.

But it's decidedly nontrivial to do right, and it requires every IO to go
through the AIO subsystem, which isn't yet the case (buffered writes & all WAL
IO don't yet, for example).


> As far as I know this is an area where we just let each query request what
> they need and hope for the best. What happens if we have two sequential
> scans in different tables?  the disk access could interleave pages of the
> two scans, falling into a random access pattern, right?

Each stream would generate 128kB sized reads (that's the default
io_combine_limit) that would then be interspersed between the two
streams. Because of the larger size that's fine.

Central coordination of IOs couldn't do anything better either. You have,
after all, two streams processing data that are on different parts of the
disk. So you will have some non-sequential IO.


> Notes:
>
> [a] I did some profiling with sample [4], and tried to spot functions with
> the highest increase or decrease in run time, but I was too confused, no
> point in dumping raw logs here.
>
> [b] This time I used a (SELECT string_agg((i*j)::text, '+') FROM
> generate_series(1, 50)) instead of repeat('x', 100), just to prevent it
> from compressing to nothing when I try larger payloads, and hit the TOAST
> thresholds. I removed the primary key `id` because it was annoying to take
> 20 minutes to insert the data in the large scale test.

FWIW, because your query uses length(payload), it actually ends up spending a
lot of CPU time in that, reducing the effect size of IO boundedness.

postgres[1077180][1]=# EXPLAIN (ANALYZE) SELECT length(payload) FROM prefetch_test_data_1000000 ORDER BY sequential;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                               QUERY PLAN                                                                               │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_sequential_1000000 on prefetch_test_data_1000000  (cost=0.42..59733.43 rows=1000000 width=8) (actual time=0.046..543.910 rows=1000000.00 loops=1) │
│   Index Searches: 1                                                                                                                                                    │
│   Buffers: shared hit=33985                                                                                                                                            │
│ Planning Time: 0.125 ms                                                                                                                                                │
│ Execution Time: 576.192 ms                                                                                                                                             │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


postgres[1077180][1]=# EXPLAIN (ANALYZE) SELECT id FROM prefetch_test_data_1000000 ORDER BY sequential;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                               QUERY PLAN                                                                               │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_sequential_1000000 on prefetch_test_data_1000000  (cost=0.42..57233.43 rows=1000000 width=8) (actual time=0.036..165.802 rows=1000000.00 loops=1) │
│   Index Searches: 1                                                                                                                                                    │
│   Buffers: shared hit=33985                                                                                                                                            │
│ Planning Time: 0.119 ms                                                                                                                                                │
│ Execution Time: 199.965 ms                                                                                                                                             │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


Greetings,

Andres Freund