Message-ID: <e3f17ba9-7160-4fc3-a756-03ca27d148f1@vondra.me>
Date: Mon, 16 Feb 2026 09:41:01 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: index prefetching
To: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Cc: Andres Freund <andres@anarazel.de>, Peter Geoghegan <pg@bowt.ie>,
 Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz
 <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
References: <CAH2-WzkC_EdGB_tdEQr63BiOuJbcvZiV=xO+OJYux4wOaxPD8w@mail.gmail.com>
 <CAH2-Wzn8whv-RE4E5L2WQB_Ha_jy0UReJq1yypu83gDYHeXq5g@mail.gmail.com>
 <CAH2-Wzm7-QuDOs6TcqfhhDsGEZCuHtn=D-SriOTnTZ_fiXNBvA@mail.gmail.com>
 <CAH2-WzmH7pVQ0-mYAxb82aWbz29_BiBPq2wV5p7+1o2sRFqDRQ@mail.gmail.com>
 <CAH2-Wz=6a7fGz2rALDX+xiFDuEaGQWpZ49xEaBUDKiPH8gcL+Q@mail.gmail.com>
 <CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com>
 <CAH2-WzmymSyOt5Y2RGbm6cJXg18J_ttfqjdcpodHe6Gp23ConQ@mail.gmail.com>
 <CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com>
 <CAE8JnxN_EwnTLLMWGhvgwaomYZ0ysm7NeogA-BqBd=Rs3S7Oqw@mail.gmail.com>
 <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl>
 <a67mvhyi2q45eg4eimhpwdg6l3s3dmpahti2svffvmvzwmss27@r4nohusvndbq>
 <c19a18fc-ef2e-4d91-b7ba-576d1891315d@vondra.me>
 <CAE8JnxNOV9kOgmU1-WUWts9Q-Jj_Nf0K480wyEwJXUQYMnYu3g@mail.gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: <CAE8JnxNOV9kOgmU1-WUWts9Q-Jj_Nf0K480wyEwJXUQYMnYu3g@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Archived-At: <https://www.postgresql.org/message-id/e3f17ba9-7160-4fc3-a756-03ca27d148f1%40vondra.me>
Precedence: bulk


On 2/16/26 06:30, Alexandre Felipe wrote:
> Thank you for your comments Tomas and Andres,
> 
> I will try to be brief.
> 
> ## Large table MacOS test
> 
> I did a 1000x larger test in MacOS, with the sql script . prefetch had
> negligible effect for random and periodic, and made sequential 16%
> slower [a].
> 

How did you do that? Did you increase the number of rows, make the rows
wider (by increasing the 'repeat' parameter in the script), or something
else? Did you verify the table really is 1000x larger?

> 
> ## Small scale linux test.
> 
> I have repeated the tests with a 10k table row on a linux system, this
> time I tried using either a SSD or a HDD, with shared_buffers of either
> 128kB or 128MB, and either psql or python with psycopg. I used a
> slightly different table [b].
> 

The "10k table row" means repeat('x',10000) when generating data? Oh, I
see you're using some string_agg(), to make it not compress. But note
that if it's TOASTed, it become entirely irrelevant for the prefetching
test because it's in a separate relation.

Unfortunately, you have not included the new script, so we can't try
reproducing your results.

128kB shared buffers is a little bit ... insane. I refuse to optimize
anything for this value, and I don't even call about regressions. Even
128MB is not really practical, any serious system caring about
performance will use tens or hundreds of GBs of shared buffers.

> Prefetch makes...
> 
> SDDwith low available buffers is significantly slower, except for
> sequential reads.
> SDD with cold OS reads is significantly faster for random access.
> HDD faster, but not reliably (high variance).
> > I am not questioning the usefulness of the patch, and I know that there
> is a log of work already put into it. The reason why I decided to review
> it is because I believe this can be one important step forward. But I
> hope that it is not hard to agree that these (counter)examples suggest
> that there are some edges to be pruned. Where I work, most of the
> queries will access at most a few hundred lines and are expected to be
> perceived as instantaneous.
> 

No comment unless I can reproduce these results.

> If the tests I am doing are pointless, should we consider having
> something in the planner to prevent these scans from using prefetch?
> 

How would you do that? Please explain.

The planner would need to know things it fundamentally does not (and
can't) know - for example it does not know if it'll touch only cached
data, or the particular I/O pattern (which may matter a lot).

There are runtime provisions to not enable prefetching in some cases,
and we may improve them in the future. But all of this is heuristics,
and every heuristics has a set of adversary cases.

> Should we introduce centralized coordination for IO? As far as I know
> this is an area where we just let each query request what they need and
> hope for the best. What happens if we have two sequential scans in
> different tables? the disk access could interleave pages of the two
> scans, falling into a random access pattern, right?
> 

The patch does nothing about that, it's the same as now. And we have no
ambition to work on that.

> ### Cache control
> 
> This is a way to make the script run without sudo in linux, you give
> ownership to root, and then you pin this program to the owner
> % gcc drop_cache.c -o drop_cache;
> % sudo chown root:root drop_cache;
> % sudo chmod 4755 drop_cache;
> 
> In MacOS  purge in the sudoers[2] temporarily, similar to [3]
> user ALL=(ALL) NOPASSWD: /usr/bin/purge
> So that I don't need to give sudo privileges to the script (that imports
> code that I am not even aware of).
> 

... or you could modify the script to simply use sudo.

> 
> Notes:
> 
> [a] I did some profiling with sample [4], and tried to spot functions
> with the highest increase or decrease in run time, but I was too
> confused, no point in dumping raw logs here.
> 
> [b] This time I used a (SELECT string_agg((i*j)::text, '+') FROM
> generate_series(1, 50)) instead of repeat('x', 100), just to prevent it
> from compressing to nothing when I try larger payloads, and hit the
> TOAST thresholds. I removed the primary key `id` because it was annoying
> to take 20 minutes to insert the data in the large scale test.
> 

It's better to load the data first, and then create indexes.


regards

-- 
Tomas Vondra