Feedback-ID: id4a34324:Fastmail
Date: Tue, 24 Mar 2026 12:26:53 -0400
From: Andres Freund <andres@anarazel.de>
To: Peter Geoghegan <pg@bowt.ie>
Cc: Tomas Vondra <tomas@vondra.me>,
	Alexandre Felipe <o.alexandre.felipe@gmail.com>,
 Thomas Munro <thomas.munro@gmail.com>,
	Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>,
	Melanie Plageman <melanieplageman@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
	Georgios <gkokolatos@protonmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>,
	Dilip Kumar <dilipbalaut@gmail.com>
Subject: Re: index prefetching
Message-ID: <jx7xsohhk3utl2tdvme4knj4ar6u5ujcgzrermfpqx3aahb2wr@hex6tuakcwyl>
References: 
 <CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
 <ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
 <CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
 <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
 <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
 <CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
 <CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
 <CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
 <t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
 <CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: 
 <CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/jx7xsohhk3utl2tdvme4knj4ar6u5ujcgzrermfpqx3aahb2wr%40hex6tuakcwyl>
Precedence: bulk

Hi,

On 2026-03-21 19:01:44 -0400, Peter Geoghegan wrote:
> I can immediately act on most of what you've said here. I'm planning
> to commit the first patch (and maybe the hash index fake LSN patch) in
> the next couple of days, ahead of posting a new v17 of the patch set.
> You can expect any item I reply to with "fixed" or similar to be in
> that version. Other items might not be addressed in v17 -- generally
> because I require more context or feedback to act.

Cool.


> > > This is preparatory work for an upcoming commit that will need xs_blk
> > > to manage buffer pin transfers between the scan and the executor slot.
> >
> > A subsequent commit adds an earlier ExecClearTuple(slot) to make the buffer
> > refcount decrement cheaper (due to hitting the one-element cache in
> > bufmgr.c). I wonder if it's worth pulling that into this commit? Mostly to
> > make that larger commit smaller.
> 
> Is it really worth doing that without also doing the
> xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff? We need batches
> to do the latter.

I see perf benefits from it alone, yes.


> v17 will do it the other way around: it'll delay adding the
> ExecClearTuple(slot)  as well as the
> xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff until a *later*
> commit. That seems more logical to me.

Shrugh, also works.


> > The commit message doesn't mention that this affects ammarkpos/amrestrpos.
> 
> It does not. But FWIW there's a prominent "Note" about it in the SGML docs.

Approximately nobody looking at the commit to see what they need to change
will see that...


> > > + #define HEAP_BATCH_VIS_CHECKED              0x01    /* checked item in VM? */
> > > + #define HEAP_BATCH_VIS_ALL_VISIBLE  0x02    /* block is known all-visible? */
> >
> > So we only 2 out of 8 bits. I guess it's probably not worth doing bit shift
> > stuff. But still curious if you tried?
> 
> No, I didn't try. I doubt saving space is worthwhile, since we'll need
> a relatively large allocation for batches used during index-only scans
> regardless.

Seems fine to not care for now. But, FWIW, the motivating reason wouldn't be
to to really save memory, it'd be to make it more likely the data fits into a
higher level of the cache.


> > > +typedef struct IndexScanBatchData
> > > +{
> > > +     XLogRecPtr      lsn;                    /* index page's LSN */
> >
> > Still doubtful this should be in generic infra (together with
> > indexam_util_batch_unlock()).
> 
> We're now advertising that indexam_util_batch_unlock is optional, and
> that index AMs can go there own way when needed.

Fair enough.


> > > +
> > > +     /*
> > > +      * heap blocks fetched counts (incremented by index_getnext_slot calls
> > > +      * within table AMs, though only during index-only scans)
> > > +      */
> > > +     uint64          nheapfetches;
> > >  } IndexScanInstrumentation;
> >
> > s/heap/table/ for anything new imo.
> 
> Usually I'd agree, but here we're using the same name as the one
> presented in EXPLAIN ANALYZE. IMV this should match that. (Maybe the
> EXPLAIN ANALYZE output should change, and this field name along with
> it, but that's another discussion entirely.)

I think if we add it into more and more places it'll get harder and harder to
eventually fix...


> > > + * Note on Memory Ordering Effects
> > > + * -------------------------------
> 
> > ISTM that moving this comment here is hiding it too much, I'm pretty sure
> > there are other AMs using visibilitymap out there.  Perhaps we should just
> > move it to the top of visibilitymap.c?
> 
> I think that the most natural place for it in visibilitymap.c is just
> above visibilitymap_get_status. They'll be moved over there in the
> next revision.

Agreed.


> > > +static pg_attribute_hot IndexScanBatch
> > > +heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
> > > +                                      IndexScanBatch priorBatch, BatchRingItemPos *pos)
> > > +{
> 
> > > +static pg_attribute_hot pg_attribute_always_inline ItemPointer
> > > +heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
> > > +                                              ScanDirection direction, bool *all_visible)
> > > +{
> 
> > A lot of this also seems like it should be generic code.  Where it actually
> > starts to be heapam specific is in heapam_batch_return_tid() - and there's two
> > calls to that.  So maybe there should be generic helper for the bulk of
> > heapam_batch_getnext_tid() that's called by the heapam version, which either
> > returns a NULL upwards or does a single heapam_batch_return_tid().
> 
> I agree that it might very well make sense to break down
> heapam_batch_getnext and heapam_batch_getnext_tid into inline
> functions that allow for some amount of code reuse.

Code reuse and making it easier for other AMs to adapt this...


> But I'm not going to have time to fix that before the next revision (too
> much performance validation is required).

Fair.


> > I also suspect it'd be worth creating a new heapam.c file for this new
> > code. heapam_index.c or such.
> 
> I had thought about that myself. How would that be structured, in
> terms of the commits?

I'm imagining something like heapam_iscan.c or heapam_indexfetch.c or such.

I'd introduce it by adding it in a commit that moves heap_hot_search_buffer(),
and heapam_index_fetch_{begin,reset,end}() into it.

Moving heap_hot_search_buffer() into the same file will be nice because it'll
allow partial inlining of it into some really performance sensitive functions.


> It's pretty clear that we'd have to move some existing heapam_handler.c code
> as part of this whole process (e.g., heapam_index_fetch_tuple).

> I think that it would likely be easiest if we added an extra commit,
> right after "heapam: Track heap block in IndexFetchHeapData using
> xs_blk" and right before "Add interfaces that enable index
> prefetching". This would be a strictly mechanical commit that moved
> the code to the new file without adding anything else.


Obviously agreed.


I think we should move the seq/tid scan stuff into its own file too, but
that's obviously a separate thread.


> > > @@ -281,7 +280,23 @@ index_beginscan(Relation heapRelation,
> > >        */
> > >       scan->heapRelation = heapRelation;
> > >       scan->xs_snapshot = snapshot;
> > > +     scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);
> >
> > Elsewhere you have an xxx comment about making sure the snapshot is
> > pushed/registered - seems it should be here, not there...
> 
> That comment was copied from index_getnext_tid (it's still there in
> the patch/hasn't been moved). This was based on the theory that it
> made just as much sense in the new code paths, which generally won't
> call index_getnext_tid. And, there are numerous other identical
> comments, including in procarray.c.

Ah, right.


> I'm fine with consolidating all this, but I'd prefer it if you gave me
> guidance on what to do with all of them.

I guess we just need to fix this crap one of these days :(

So just ignore what I said on this, I don't see a great solution immediately.


> > Any reason this isn't in index_beginscan_internal(), given both
> > index_beginscan() and index_beginscan_parallel() need it?  I realize you'd
> > need to add arguments to index_beginscan_internal(), but I don't see a problem
> > with that.  Alternatively a helper for this seems like a possibility too.
> 
> Fixed, by moving much more of the initialization done by each variant
> (index_beginscan, index_beginscan_bitmap, index_beginscan_parallel)
> into index_beginscan_internal itself.

Nice.  Haven't checked out your new version yet - are you doing that as a
separate commit?


> > Hm. Would it be worth using a much wider ring position to avoid this kind of
> > danger?
> 
> I don't think so. It worked that way a few months ago, but I felt
> using a ring buffer with frequently overflowing offsets would be
> easier to test. Note that we reset all offsets to 0 to handle a change
> in scan direction, so the offsets aren't entirely immutable.

Well, a 64bit one would never overflow :)

But, jokes aside, fair enough, either there should never be overflows or they
should be frequent.


> > Couldn't it be that markBatch is one behind scanBatch, in which case wouldn't
> > need to throw out all the prefetched batches (in the future commits that do
> > prefetching)?
> 
> We can very often use the "scanBatch == markBatch" happy path, since
> in practice merge joins rarely need to restore a markPos that's very
> far behind scanPos. This is made more likely by some of the nbtree
> work from the past 8 years, such as suffix truncation and
> deduplication. Equal index tuples naturally avoid spanning multiple
> index leaf pages unless it's absolutely unavoidable because there are
> too many to fit on one page.


> > My understanding is that it'd be rather likely that, if the
> > markPos is not the current batch, it'd be in the last batch.
> > I agree: cases that cannot use the "scanBatch == markBatch" happy path
> are very likely to restore a markBatch for the batch that was most
> recently removed from the head of the ring buffer. However, I'd much
> rather avoid a special case where we don't remove markBatch from the
> batch ring buffer, just in case we restore the mark (which probably
> won't happen, since most marks taken are never restored). I get why
> you'd ask about this (naturally I thought of it myself), but I just
> can't see it paying for itself.
> 
> Consider the costs. The complexity stems from it breaking the
> "scanBatch == headBatch" invariant. That would affect pausing. When
> the read stream callback pauses, it expects the scan code to consume
> all remaining items from scanBatch and then remove scanBatch from the
> ring buffer, allowing the read stream to be resumed in passing. Like a
> cross-batch restore, pausing is itself a hard-to-hit code path;
> testing the intersection of those 2 things would be very tricky.
> (There are other reasons why I want to keep this invariant, which I
> won't detail now.)
> 
> I also don't see much benefit. We're already comfortably beating the
> master branch with merge joins that heavily use mark/restore, due to
> improved buffer pin management (particularly with index-only scans).
> The need to re-read index pages in these cases is nothing new. We
> reset the read stream whenever we restore a mark (even in the happy
> path), so there's relatively little risk of the distance to get out of
> hand afterwards (the index pages are very likely to still be in
> shared_buffers each time).

Fair.  I did see some large number of index IO page accesses that look like it
could be avoided by looking back one batch, but that was an intentional
torture query that I can't see as being realistic.


> > Do we have decent test coverage of queries changing scan direction? Seems like
> > something that could quite easily have bugs.
> 
> You're right to be concerned about potential bugs here. And about the
> existing test coverage.
> 
> While the regression tests have limited coverage, my personal test
> suite covers many variations. Some of these come from my work on the
> nbtree projects included in Postgres 17 and 18. Others were derived
> from a stress-testing suite developed by Tomas that generates
> cursor-based queries that randomly scroll back and forth, verifying
> that the patch agrees with master at every step.
> 
> Improving the core regression tests in this area would be difficult. A
> test case like this must focus on one specific problematic page
> transition (or a small series of related transitions). Obviously, that
> depends on the data being laid out precisely, which depends on many
> implementation details. And we usually need quite a bit of data to
> tickle the implementation in just the right/wrong way.

I don't think we necessarily need the coverage of your full torture test suite
in core, but I feel some basic sanity tests really ought to be in the core
tests. There's very little, from what I can tell.  Even just making sure we
have coverage for a index [only] scans going forward and backward, and the
same for merge & nestloop joins would be quite a win.


> > > +void
> > > +tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
> > > +{
> 
> > Any reason to not use indexam_util_batch_release() to implement the release?
> > Seems nicer to not have two copies of this code.
> 
> Yes: We don't always want to use the batch cache, even when it's
> available. In particular, we don't want to do that during simple point
> queries (like those from pgbench SELECT), when batches are freed only
> as the scan ends.

> IOW, when we're called through index_batchscan_end (specifically, when
> we're called through index_batchscan_reset when it is called by
> index_batchscan_end) we don't want to use the cache. It seemed to make
> sense to implement this in a way that didn't require any special
> handling from within indexam_util_batch_release (since it's not a
> concern of index AMs).

I am not really following.  Right now there's two close copies of this code:

void
indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch)
...
	 * Try to store caller's batch in this amgetbatch scan's cache of
	 * previously released batches first
	 */
	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
	{
		if (scan->batchcache[i] == NULL)
		{
			/* found empty slot, we're done */
			scan->batchcache[i] = batch;
			return;
		}
	}

	/*
	 * Failed to find a free slot for this batch.  We'll just free it
	 * ourselves.
	 */
	if (batch->deadItems)
		pfree(batch->deadItems);
	pfree(batch_alloc_base(batch, scan));


void
tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
{
...
	/*
	 * Use cache, just like indexam_util_batch_release does it (unless scan is
	 * shutting down)
	 */
	if (scan->xs_heapfetch)
	{
		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
		{
			if (scan->batchcache[i] == NULL)
			{
				/* found empty slot, we're done */
				scan->batchcache[i] = batch;
				return;
			}
		}
	}

	if (batch->deadItems)
		pfree(batch->deadItems);
	pfree(batch_alloc_base(batch, scan));


Yes, one of them has the additional "if (scan->xs_heapfetch)" condition, but
that hardly seems like a real problem preventing sharing the code.


I don't think that "if (scan->xs_heapfetch)" is a particularly obvious
signifier of the scan being terminated, FWIW. I'd probably either add a
parameter to tableam_util_free_batch() signifying whether caching is desired
or add an internal helper with that parameter that's then used internally in
indexbatch.c.

Greetings,

Andres Freund