Re: index prefetching - Andres Freund

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andres Freund <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Thu, 14 Aug 2025 14:53:42 -0400
Message-ID: <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok> (raw)
In-Reply-To: <[email protected]>
References: <CAH2-Wzko86NwiENCJGtakJ=fOhWpr-Yz-F+1oxgv2Ku1mvXwvA@mail.gmail.com>
	<[email protected]>
	<c7a77pcyc5eynme376wvyojryijtlieyxsu3bvxp4eiy6au6uf@caniulyi4jr5>
	<CAH2-Wz=L7h-koDKa3_NEg39Faw7MrOkSVOsodvQ4toSQahvWjQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAH2-WzmuGzTH-62EWTgQ4F66XEBJtJk25psF4GDuAGqeC4a34g@mail.gmail.com>
	<[email protected]>

Hi,

On 2025-08-14 14:44:44 -0400, Peter Geoghegan wrote:
> On Thu Aug 14, 2025 at 1:57 PM EDT, Peter Geoghegan wrote:
> > The only interesting thing about the flame graph is just how little
> > difference there seems to be (at least for this particular perf event
> > type).
>
> I captured method_io_uring.c DEBUG output from running each query in the
> server log, in the hope that it would shed some light on what's really going
> on here.  I think that it just might.
>
> I count a total of 12,401 distinct sleeps for the sequential/slow backwards
> scan test case:
>
> $ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | head
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
> $ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | awk '{ total += $11 } END { print total }'
> 12401
>
> But there are only 3 such sleeps seen when the random backwards scan query is
> run -- which might begin to explain the mystery of why it runs so much faster:
>
> $ grep -E "wait_one with [1-9][0-9]* sleeps" random.txt | awk '{ total += $11 } END { print total }'
> 104

I think this is just an indicator of being IO bound. That message is output
whenever we have to wait for IO to finish. So if one workload prints that a
12k times and another 104 times, that's because the latter didn't have to wait
for IO to complete, because it already had completed by the time we needed the
IO to have finished to continue.


Factors potentially leading to slower IO:

- sometimes random IO *can* be faster for SSDs, because it allows different
  flash chips to work concurrently, rather than being bound by the speed of
  one one flash chip

- it's possible that with your SSD the sequential IO leads to more IO
  combining. Larger IOs always have a higher latency than smaller IOs - but
  obviously fewer IOs are needed. The increased latency may be bad enough for
  your access pattern to trigger more waits.

  It's *not* necessarily enough to just lower io_combine_limit, the OS also
  can do combining.

  I'd see what changes if you temporarily reduce
  /sys/block/nvme6n1/queue/max_sectors_kb to a smaller size.


Could you show iostat for both cases?

Greetings,

Andres Freund

view thread (348+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox