Message-ID: <0dd33755-cab8-49c8-b1ed-698732577fbb@vondra.me>
Date: Tue, 26 Aug 2025 17:06:11 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Peter Geoghegan <pg@bowt.ie>, Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
References: <DC2DFSYXBDSJ.2N7WXLD4JT05H@bowt.ie>
 <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok>
 <CAH2-WzntgDeopLJpyEbUh23Qr1vgoYv5jbFkYsymTScEKxBj7A@mail.gmail.com>
 <CAH2-WzkaTHg2X9R-gLRNBEoL82t2mkrQq-3f=y3GAzrj40fFZw@mail.gmail.com>
 <kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7>
 <CAH2-WzkWNtCRTcUajGYrCkp9-+btteAthg21BzxbKV09AJuSrA@mail.gmail.com>
 <CAH2-WzkgkvbN_GqR+pfE7uKwhWxQ6h4jst7Rpjgrt68Vc1=FDA@mail.gmail.com>
 <5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn>
 <6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev>
 <ce485718-f49e-4cbc-99de-5b935fe98926@vondra.me>
 <y32ow7vfdrrcchxrwgfr5tbwg3ncm4f3rlp65svcbqorbm3tmn@cwvipbplnqyt>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: <y32ow7vfdrrcchxrwgfr5tbwg3ncm4f3rlp65svcbqorbm3tmn@cwvipbplnqyt>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: 
 <https://www.postgresql.org/message-id/0dd33755-cab8-49c8-b1ed-698732577fbb%40vondra.me>
Precedence: bulk


On 8/26/25 01:48, Andres Freund wrote:
> Hi,
> 
> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>> Thanks. Based on the testing so far, the patch seems to be a substantial
>> improvement. What's needed to make this prototype committable?
> 
> Mainly some testing infrastructure that can trigger this kind of stream. The
> logic is too finnicky for me to commit it without that.
> 

So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that

- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)

   (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

> 
>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>> but it's harder to hit / the impact is not as bad as on PG19.
> 
> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
> where it's a real issue. I can repro a slowdown when using many parallel
> seqscans with debug_io_direct=data - but that's even slower in 17...
> 

Makes sense.

> 
>> On a related note, my test that generates random datasets / queries, and
>> compares index prefetching with different io_method values found a
>> pretty massive difference between worker and io_uring. I wonder if this
>> might be some issue in io_method=worker.
> 
>> while with index prefetching (with the aio prototype patch), it looks
>> like this:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26 dirtied=1
>>    Planning Time: 1.032 ms
>>    Execution Time: 3150.578 ms
>>   (16 rows)
>>
>> So it's about 2x slower. The prefetch distance collapses, because
>> there's a lot of cache hits (about 50% of requests seem to be hits of
>> already visited blocks). I think that's a problem with how we adjust the
>> distance, but I'll post about that separately.
>>
>> Let's try to simply set io_method=io_uring:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t  (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26
>>    Planning Time: 2.212 ms
>>    Execution Time: 1837.615 ms
>>   (16 rows)
>>
>> That's much closer to master (and the difference could be mostly noise).
>>
>> I'm not sure what's causing this, but almost all regressions my script
>> is finding look like this - always io_method=worker, with distance close
>> to 2.0. Is this some inherent io_method=worker overhead?
> 
> I think what you might be observing might be the inherent IPC / latency
> overhead of the worker based approach. This is particularly pronounced if the
> workers are idle (and the CPU they get scheduled on is clocked down). The
> latency impact of that is small, but if you never actually get to do much
> readahead it can be visible.
> 

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.


regards

-- 
Tomas Vondra