Re: index prefetching - Tomas Vondra

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tomas Vondra <[email protected]>
To: Peter Geoghegan <[email protected]>
To: Andres Freund <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Wed, 13 Aug 2025 23:19:29 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAH2-Wz=L7h-koDKa3_NEg39Faw7MrOkSVOsodvQ4toSQahvWjQ@mail.gmail.com>
References: <[email protected]>
	<CAH2-WzmdiO4fHA1O06SYUjgMQZG7haysY7Tu5DS5z-CHsv5MLQ@mail.gmail.com>
	<[email protected]>
	<CAH2-Wz=Y-PsC6_tZOPhHWvPx0geGnrh9VKjUZ-168ezUM_XM2Q@mail.gmail.com>
	<CA+hUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAH2-Wzko86NwiENCJGtakJ=fOhWpr-Yz-F+1oxgv2Ku1mvXwvA@mail.gmail.com>
	<[email protected]>
	<c7a77pcyc5eynme376wvyojryijtlieyxsu3bvxp4eiy6au6uf@caniulyi4jr5>
	<CAH2-Wz=L7h-koDKa3_NEg39Faw7MrOkSVOsodvQ4toSQahvWjQ@mail.gmail.com>



On 8/13/25 18:01, Peter Geoghegan wrote:
> On Wed, Aug 13, 2025 at 11:28 AM Andres Freund <[email protected]> wrote:
>>> With "sync" I always get this (after a restart):
>>>
>>>    Buffers: shared hit=7435 read=52801
>>>
>>> while with "worker" I get this:
>>>
>>>    Buffers: shared hit=4879 read=52801
>>>    Buffers: shared hit=5151 read=52801
>>>    Buffers: shared hit=4978 read=52801
>>>
>>> So not only it changes run to tun, it also does not add up to 60236.
>>
>> This is reproducible on master? If so, how?
> 
> AFAIK it is *not* reproducible on master.
> 
>> If this happens only with the prefetching patch applied, is is possible that
>> what happens here is that we occasionally re-request buffers that already in
>> the process of being read in? That would only happen with a read stream and
>> io_method != sync (since with sync we won't read ahead). If we have to start
>> reading in a buffer that's already undergoing IO we wait for the IO to
>> complete and count that access as a hit:
> 
> This theory seems quite plausible to me. Though it is a bit surprising
> that I see incorrect buffer hit counts on the "good" forwards scan
> case, rather than on the "bad" backwards scan case.
> 
> Here's what I mean by things being broken on the read stream side (at
> least with certain backwards scan cases):
> 
> When I add instrumentation to the read stream side, by adding elog
> debug calls that show the blocknum seen by read_stream_get_block, I
> see out-of-order and repeated blocknums with the "bad" backwards scan
> case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a
> desc"):
> 
> ...
> NOTICE:  index_scan_stream_read_next: index 1163 TID (25052,21)
> WARNING:  prior lastBlock is 25053 for batchno 2856, new one: 25052
> WARNING:  blocknum: 25052, 0x55614810efb0
> WARNING:  blocknum: 25052, 0x55614810efb0
> NOTICE:  index_scan_stream_read_next: index 1161 TID (25053,3)
> WARNING:  prior lastBlock is 25052 for batchno 2856, new one: 25053
> WARNING:  blocknum: 25053, 0x55614810efb0
> NOTICE:  index_scan_stream_read_next: index 1160 TID (25052,19)
> WARNING:  prior lastBlock is 25053 for batchno 2856, new one: 25052
> WARNING:  blocknum: 25052, 0x55614810efb0
> WARNING:  blocknum: 25052, 0x55614810efb0
> NOTICE:  index_scan_stream_read_next: index 1141 TID (25051,21)
> WARNING:  prior lastBlock is 25052 for batchno 2856, new one: 25051
> WARNING:  blocknum: 25051, 0x55614810efb0
> ...
> 
> Notice that we see the same blocknum twice in close succession. Also
> notice that we're passed 25052 and then subsequently passed 25053,
> only to be passed 25053 once more.
> 

I did investigate this, and I don't think there's anything broken in
read_stream. It happens because ReadStream has a concept of "ungetting"
a block, which can happen after hitting some I/O limits.

In that case we "remember" the last block (in read_stream_look_ahead
calls read_stream_unget_block), and we return it again. It may seem as
if read_stream_get_block() produced the same block twice, but it's
really just the block from the last round.

All duplicates produced by read_stream_look_ahead were caused by this. I
suspected it's a bug in lastBlock optimization, but that's not the case,
it happens entirely within read_stream. And it's expected.

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).


regards

-- 
Tomas Vondra

view thread (348+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox