Re: index prefetching - Andres Freund

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andres Freund <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Fri, 15 Aug 2025 11:09:41 -0400
Message-ID: <6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev> (raw)
In-Reply-To: <5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn>
References: <[email protected]>
	<CAH2-WzmuGzTH-62EWTgQ4F66XEBJtJk25psF4GDuAGqeC4a34g@mail.gmail.com>
	<[email protected]>
	<6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok>
	<CAH2-WzntgDeopLJpyEbUh23Qr1vgoYv5jbFkYsymTScEKxBj7A@mail.gmail.com>
	<CAH2-WzkaTHg2X9R-gLRNBEoL82t2mkrQq-3f=y3GAzrj40fFZw@mail.gmail.com>
	<kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7>
	<CAH2-WzkWNtCRTcUajGYrCkp9-+btteAthg21BzxbKV09AJuSrA@mail.gmail.com>
	<CAH2-WzkgkvbN_GqR+pfE7uKwhWxQ6h4jst7Rpjgrt68Vc1=FDA@mail.gmail.com>
	<5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn>

Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
> > On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <[email protected]> wrote:
> > > > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > > > buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> > > > real-world queries will often encounter the pattern of the same block being
> > > > read in by a read stream multiple times in close proximity sufficiently often
> > > > to make that worth it.
> > >
> > > We definitely need to be prepared for duplicate prefetch requests in
> > > the context of index scans.
> >
> > Can you (or anybody else) think of a quick and dirty way of working
> > around the problem on the read stream side? I would like to prioritize
> > getting the patch into a state where its overall performance profile
> > "feels right". From there we can iterate on fixing the underlying
> > issues in more principled ways.
>
> I think I can see a way to fix the issue, below read stream. Basically,
> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
> new flag indicating that we are waiting for an IO that was not started by the
> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
> has to be somewhat different code from today, because we have to deal with the
> fact of the "foreign" IO potentially having failed.
>
> I'll try writing a prototype for that tomorrow. I think to actually get that
> into a committable shape we need a test harness (probably a read stream
> controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

New code disabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=213.277                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.124                                     │
    │ Planning Time: 3.269 ms                                              │
    │ Execution Time: 1023.279 ms                                          │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


New code enabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=217.225                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.009                                     │
    │ Planning Time: 2.685 ms                                              │
    │ Execution Time: 602.987 ms                                           │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


With the change enabled, the sequential query is faster than the random query:

    #### backwards random table ####
    ┌────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                         QUERY PLAN                                         │
    ├────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_randomized_pk on t_randomized (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                                              │
    │   Index Searches: 1                                                                        │
    │   Buffers: shared hit=6085 read=77813                                                      │
    │   I/O Timings: shared read=347.285                                                         │
    │ Planning:                                                                                  │
    │   Buffers: shared hit=127 read=5                                                           │
    │   I/O Timings: shared read=1.001                                                           │
    │ Planning Time: 1.751 ms                                                                    │
    │ Execution Time: 820.544 ms                                                                 │
    └────────────────────────────────────────────────────────────────────────────────────────────┘
    (10 rows)



Greetings,

Andres Freund

view thread (348+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox