public inbox for [email protected]  
help / color / mirror / Atom feed
From: Tomas Vondra <[email protected]>
To: Andres Freund <[email protected]>
Cc: Peter Geoghegan <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Tue, 26 Aug 2025 17:06:11 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <y32ow7vfdrrcchxrwgfr5tbwg3ncm4f3rlp65svcbqorbm3tmn@cwvipbplnqyt>
References: <[email protected]>
	<6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok>
	<CAH2-WzntgDeopLJpyEbUh23Qr1vgoYv5jbFkYsymTScEKxBj7A@mail.gmail.com>
	<CAH2-WzkaTHg2X9R-gLRNBEoL82t2mkrQq-3f=y3GAzrj40fFZw@mail.gmail.com>
	<kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7>
	<CAH2-WzkWNtCRTcUajGYrCkp9-+btteAthg21BzxbKV09AJuSrA@mail.gmail.com>
	<CAH2-WzkgkvbN_GqR+pfE7uKwhWxQ6h4jst7Rpjgrt68Vc1=FDA@mail.gmail.com>
	<5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn>
	<6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev>
	<[email protected]>
	<y32ow7vfdrrcchxrwgfr5tbwg3ncm4f3rlp65svcbqorbm3tmn@cwvipbplnqyt>



On 8/26/25 01:48, Andres Freund wrote:
> Hi,
> 
> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>> Thanks. Based on the testing so far, the patch seems to be a substantial
>> improvement. What's needed to make this prototype committable?
> 
> Mainly some testing infrastructure that can trigger this kind of stream. The
> logic is too finnicky for me to commit it without that.
> 

So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that

- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)

   (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

> 
>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>> but it's harder to hit / the impact is not as bad as on PG19.
> 
> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
> where it's a real issue. I can repro a slowdown when using many parallel
> seqscans with debug_io_direct=data - but that's even slower in 17...
> 

Makes sense.

> 
>> On a related note, my test that generates random datasets / queries, and
>> compares index prefetching with different io_method values found a
>> pretty massive difference between worker and io_uring. I wonder if this
>> might be some issue in io_method=worker.
> 
>> while with index prefetching (with the aio prototype patch), it looks
>> like this:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26 dirtied=1
>>    Planning Time: 1.032 ms
>>    Execution Time: 3150.578 ms
>>   (16 rows)
>>
>> So it's about 2x slower. The prefetch distance collapses, because
>> there's a lot of cache hits (about 50% of requests seem to be hits of
>> already visited blocks). I think that's a problem with how we adjust the
>> distance, but I'll post about that separately.
>>
>> Let's try to simply set io_method=io_uring:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t  (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26
>>    Planning Time: 2.212 ms
>>    Execution Time: 1837.615 ms
>>   (16 rows)
>>
>> That's much closer to master (and the difference could be mostly noise).
>>
>> I'm not sure what's causing this, but almost all regressions my script
>> is finding look like this - always io_method=worker, with distance close
>> to 2.0. Is this some inherent io_method=worker overhead?
> 
> I think what you might be observing might be the inherent IPC / latency
> overhead of the worker based approach. This is particularly pronounced if the
> workers are idle (and the CPU they get scheduled on is clocked down). The
> latency impact of that is small, but if you never actually get to do much
> readahead it can be visible.
> 

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.


regards

-- 
Tomas Vondra






view thread (348+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox