MIME-Version: 1.0
References: <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
 <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
 <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
 <CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
 <CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
 <CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
 <t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
 <bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh>
 <CAH2-Wz=9Wc7T6xMxynHNu6majG_Q+=1v_OXCyYC-PbvagsaTrQ@mail.gmail.com>
 <tshb2mmworpxvtmhehr45mhdvvp7wsceqbf5iycv7tisn73nq4@2ewrw6equ3cj>
In-Reply-To: <tshb2mmworpxvtmhehr45mhdvvp7wsceqbf5iycv7tisn73nq4@2ewrw6equ3cj>
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 24 Mar 2026 21:34:51 -0400
Message-ID: 
 <CAH2-Wz=r13YutKUoz+dt8hJQGQzrqO2U6A3XY=zScDd=qP+odA@mail.gmail.com>
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: Tomas Vondra <tomas@vondra.me>,
 Alexandre Felipe <o.alexandre.felipe@gmail.com>,
	Thomas Munro <thomas.munro@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>,
	Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-Wz%3Dr13YutKUoz%2Bdt8hJQGQzrqO2U6A3XY%3DzScDd%3DqP%2BodA%40mail.gmail.com>
Precedence: bulk

On Tue, Mar 24, 2026 at 1:27=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:
> > But that means that it won't be triggered when we don't enter the "if
> > (hscan->xs_blk !=3D ItemPointerGetBlockNumber(tid))" block that contain=
s
> > all this code. Besides, it just doesn't seem possible that
> > heap_page_prune_opt would release its caller's pin.
>
> I was more concerned about read_stream_next_buffer() returning the wrong
> block, due to prefetching somehow "desynchronizing" with the scan positio=
n and
> catching that when it's clear that we just read a new block, rather than =
in a
> place where it could be either the continuation of a scan on the same pag=
e or
> a new page.

Then I don't follow. The existing assertions will catch that (I should
know, they've failed enough times during development).

Basically, I don't get the concern about heap_page_prune_opt releasing
its caller's pin. Even if that happened, the existing assertions would
still catch it.

> I think I had largely missed the "danger" of index only scans here. I thi=
nk
> it'd be good to call that out more explicitly in these comments.

Will do.

> > > Does this only happen when paused?
> >
> > This "prefetchPos->valid =3D false" stuff is approximately the opposite
> > of pausing. Pausing resolves the problem of prefetchPos getting so far
> > ahead of scanPos that the batch ring buffer runs out of slots. Whereas
> > this prefetchPos invalidation code helps the read stream deal with
> > prefetchPos falling behind scanPos.
>
> Because I had somewhat missed the real cause of the problem - not calling=
 the
> read stream code due to index only scans - I thought that somehow we coul=
d end
> up in this state due to not resuming prefetching before the scan position
> overtakes the prefetch position. But I don't think that actually happen.

Right, it can't happen. In any case the assertions we have are quite
effective at catching problems like that. For example, if we don't
resume prefetching and consume another batch, there's an assertion for
that. Actually, there's more than one. There's a direct assertion, on
the scan side. And the read stream callback itself has a precondition
assertion that the read stream is not paused.

> > > Wonder if it's worth somehow asserting that after this the page is ac=
tually
> > > unguarded after the call.
> >
> > We used to, but the new layering forced me to remove it. Any ideas
> > about how to add it back?
>
> Adding an "isGuarded" field to IndexScanBatchData would be the easiest
> way. That way we can make assertions about the state without knowing anyt=
hing
> about the internal mechanism of how guarding is implemented.
>
> I doubt setting/clearing that field even when assertions are disabled wil=
l be
> measurable, as long as you place it alongside the other booleans where th=
ere's
> padding space available.

I've prototyped that, and it works well. It'll be in v18.

> After replacing the pause with an error I found that it's surprisingly ea=
sy to
> hit on slow storage (or on fast storage if you set needed_wait=3Dtrue in
> read_stream_next_buffer()).  I've not done any performance validation on
> whether that means the limit is too low.

It's been a while since I last validated performance to justify the
current maximum number of batches. I used buffered I/O for that. I'm
sure that a higher maximum with very slow storage and a very high
effective_io_concurrency will provide some benefit. But perfectly
handling that isn't essential for the first committed version of index
prefetching.

I must admit I'm unsure how to evaluate the maximum number of batches.
It can make sense to pursue diminishing returns. But up to what point,
and according to what principle?

--=20
Peter Geoghegan