Re: index prefetching - Andres Freund

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andres Freund <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Tue, 31 Mar 2026 16:22:36 -0400
Message-ID: <chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj> (raw)
In-Reply-To: <CAH2-WzkFRoTjD9T8ykYDzOMxzGiWFqcAkbK8B=HjfpoMdM4E8A@mail.gmail.com>
References: <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
	<CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
	<CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
	<CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
	<CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
	<t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
	<CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
	<CAH2-Wz=Vxsgas35ZzOJJW1ceqp9TJ2DFhKmXULwUAcVpfD73xA@mail.gmail.com>
	<CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com>
	<CAH2-WzkFRoTjD9T8ykYDzOMxzGiWFqcAkbK8B=HjfpoMdM4E8A@mail.gmail.com>

Hi,

On 2026-03-31 13:39:30 -0400, Peter Geoghegan wrote:
> On Fri, Mar 27, 2026 at 8:35 PM Peter Geoghegan <[email protected]> wrote:
> > Attached is v18, which addresses the tricky issues I skipped in v17.
>
> v19 is attached. It contains a few notable improvements over v18:
>
> * The first commit (the one that simply moves code into the new
> heapam_indexscan.c file) now moves heap_hot_search_buffer over there
> as well.
>
> Andres suggested this at one point, and I believe it wins us a small
> but significant performance benefit.

To me it also just makes sense.


> From 742dcbea7d184443d2c4ef3c095b60f47f870f72 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Wed, 25 Mar 2026 00:22:17 -0400
> Subject: [PATCH v19 01/18] Move heapam_handler.c index scan code to new file.
>
> Move the heapam index fetch callbacks (index_fetch_begin,
> index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
> dedicated file.  Also move heap_hot_search_buffer over.  This is a
> purely mechanical move with no functional impact.

I don't love that this renames arguments (s/call_again/heap_continue/) while
moving. Makes it harder to verify this is just the mechanical change it claims
to be.


> +/*-------------------------------------------------------------------------
> + *
> + * heapam_indexscan.c
> + *	  heap table plain index scan and index-only scan code
> + *
> + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *	  src/backend/access/heap/heapam_indexscan.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/heapam.h"
> +#include "storage/predicate.h"

Should probably include access/relscan.h for +IndexFetchTableData, rather than
relying on the indirect include.



> From 5ca838ac32a8d43510325cb400be4ecea47ea8d2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 22 Mar 2026 02:36:57 -0400
> Subject: [PATCH v19 02/18] Add slot-based table AM index scan interface.
>
> Add table_index_getnext_slot, a new table AM callback that wraps both
> plain and index-only index scans that use amgettuple.  Two new
> TableAmRoutine callbacks are introduced -- one for plain scans and one
> for index-only scans -- which an upcoming commit that adds the
> amgetbatch interface will expand to four.  The appropriate callback is
> resolved once in index_beginscan, and called through a function pointer
> (xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot
> shim function is called from executor nodes.
>
> This moves VM checks for index-only scans out of the executor and into
> heapam, enabling batching of visibility map lookups (though for now we
> continue to just perform retail lookups).

I'd also add that it's architecturally much cleaner this way.


> A small minority of callers (2 callers in total) continue to use the old
> table_index_fetch_tuple interface.  This is still necessary with callers
> that fundamentally need to pass a TID to the table AM.  Both callers
> perform some kind of constraint enforcement.

I think we may need to do something about this before all of this over. It's
just too confusing to have the generic sounding index_fetch_tuple() interfaces
that are barely ever used and *shouldn't* be used when, uhm, fetching a tuple
pointed to by an index.

At the very least it seems we should rename the ->index_fetch_tuple() to
->index_fetch_tid() or something and remove the call_again, all_dead
arguments, since they are never used in this context.  I suspect it should
*not* use the table_index_fetch_begin()/table_index_fetch_end() interface
either.


> XXX Do we still need IndexFetchTableData.rel?  We are now mostly using
> IndexScanDescData.heapRelation, but we could use it for absolutely
> everything if we were willing to revise the signature of the old
> table_index_fetch_tuple interface by giving it its own heapRelation arg.

I think we *should* change the argument of table_index_fetch_tuple().


> Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> (which the table AM needs internally for heap fetches), but continue to
> read their results from IndexScanDescData fields such as xs_itup (rather
> than from the slot itself).

Pretty this ain't.  But I guess it's been vaguely gross like this for quite a
while, so...


> This convention lets both plain and index-only scans use the same
> table_index_getnext_slot interface.

Not convinced that's really a useful goal.


> All callers can continue to rely on the scan descriptor's xs_heaptid field
> being set on each call. (execCurrentOf is the only caller that reads this
> field directly, to determine the current row of an index-only scan, but it
> seems like a good idea to do the same thing for all callers).

> XXX Should execCurrentOf continue to do this?  Can't it get the same TID
> from the table slot instead?

I think the tid from the slot is always the tid from the start of the HOT
chain, not the actual location of the tuple. That's important, as otherwise a
HOT pruning after determining the slot's tid would potentially make the tid
invalid.  Whereas, I think, xs_heaptid, is currently set to the offset of the
actual tuple, even if it's a just a HOT version.


> @@ -177,10 +181,13 @@ typedef struct IndexScanDescData
>  	struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
>
>  	ItemPointerData xs_heaptid; /* result */
> -	bool		xs_heap_continue;	/* T if must keep walking, potential
> -									 * further results */
>  	IndexFetchTableData *xs_heapfetch;
>
> +	/* Resolved getnext_slot implementation, set by index_beginscan */
> +	bool		(*xs_getnext_slot) (struct IndexScanDescData *scan,
> +									ScanDirection direction,
> +									struct TupleTableSlot *slot);
> +
>  	bool		xs_recheck;		/* T means scan keys must be rechecked */
>
>  	/*

Hm. Why is it ok that we now don't have an equivalent of xs_heap_continue on
the scan level anymore?  I see there's a local heap_continue in
heapam_index_getnext_slot(), but isn't that insufficient e.g. for a
SNAPSHOT_ANY snapshot, where all the row versions would be visible?

Am I misunderstanding how this works?



> +/*
> + * Fetch the next tuple from an index scan into `slot`, scanning in the
> + * specified direction.  Returns true if a tuple satisfying the scan keys and
> + * the snapshot was found, false otherwise.  The tuple is stored in the
> + * specified slot.
> + *
> + * Dispatches through scan->xs_getnext_slot, which is resolved once by
> + * index_beginscan.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * released by a future table_index_getnext_slot or index_endscan call.
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required.  We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + *
> + * For index-only scans, the callback also fills xs_itup/xs_itupdesc or
> + * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without
> + * a heap fetch.
> + */
> +static inline bool
> +table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
> +						 TupleTableSlot *slot)
> +{
> +	return iscan->xs_getnext_slot(iscan, direction, slot);
> +}

Hm. Does this actually belong in tableam.h? I'm a bit conflicted.


> @@ -2553,6 +2554,8 @@ static const TableAmRoutine heapam_methods = {
>  	.index_fetch_begin = heapam_index_fetch_begin,
>  	.index_fetch_reset = heapam_index_fetch_reset,
>  	.index_fetch_end = heapam_index_fetch_end,
> +	.index_plain_amgettuple_getnext_slot = heapam_index_plain_amgettuple_getnext_slot,
> +	.index_only_amgettuple_getnext_slot = heapam_index_only_amgettuple_getnext_slot,
>  	.index_fetch_tuple = heapam_index_fetch_tuple,
>
>  	.tuple_insert = heapam_tuple_insert,

Wonder if it's perhaps worth deleting _slot and shortening getnext to
next. These are quite only names...


> +/* table_index_getnext_slot callback: amgettuple, plain index scan */
> +pg_attribute_hot bool
> +heapam_index_plain_amgettuple_getnext_slot(IndexScanDesc scan,
> +										   ScanDirection direction,
> +										   TupleTableSlot *slot)
> +{
> +	Assert(!scan->xs_want_itup);
> +	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +	return heapam_index_getnext_slot(scan, direction, slot, false);
> +}
> +
> +/* table_index_getnext_slot callback: amgettuple, index-only scan */
> +pg_attribute_hot bool
> +heapam_index_only_amgettuple_getnext_slot(IndexScanDesc scan,
> +										  ScanDirection direction,
> +										  TupleTableSlot *slot)
> +{
> +	Assert(scan->xs_want_itup);
> +	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +	return heapam_index_getnext_slot(scan, direction, slot, true);
> +}
> +
>  bool
>  heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>  						 ItemPointer tid,
> @@ -233,6 +274,18 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>  						 TupleTableSlot *slot,
>  						 bool *heap_continue, bool *all_dead)
>  {
> +
> +	return heapam_index_fetch_tuple_impl(scan, tid, snapshot, slot,
> +										 heap_continue, all_dead);
> +}
> +
> +static pg_attribute_always_inline bool
> +heapam_index_fetch_tuple_impl(struct IndexFetchTableData *scan,
> +							  ItemPointer tid,
> +							  Snapshot snapshot,
> +							  TupleTableSlot *slot,
> +							  bool *heap_continue, bool *all_dead)
> +{
>  	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>  	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
>  	bool		got_heap_tuple;
> @@ -289,3 +342,172 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>
>  	return got_heap_tuple;
>  }
> +
> +/*
> + * Common implementation for both heapam_index_*_getnext_slot variants.
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot was
> + * found, false otherwise.  The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * dropped by a future call here (or by a later call to heapam_index_fetch_end
> + * through index_endscan).
> + *
> + * The index_only parameter is a compile-time constant at each call site,
> + * allowing the compiler to specialize the code for each variant.
> + */
> +static pg_attribute_always_inline bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> +						  TupleTableSlot *slot, bool index_only)

For heapam_index_fetch_tuple_impl() you added _impl, but here you didn't.
Mild preference for also adding _impl here.


> +{
> +	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +	ItemPointer tid;
> +	bool		heap_continue = false;

As mentioned above, it's not clear to me that it is correct for all scan types
to have a short-lived heap_continue.


> +	bool		all_visible = false;
> +	BlockNumber last_visited_block = InvalidBlockNumber;
> +	uint8		n_visited_pages = 0,
> +				xs_visited_pages_limit = 0;
> +
> +	if (index_only)
> +		xs_visited_pages_limit = scan->xs_visited_pages_limit;

Did you check that it's useful to have xs_visited_pages_limit as a longer
lived variable? I suspect it's just going to add register pressure and will
need to be saved/restored across other calls, making it just as cheap to
always fetch it from scan.


> +	for (;;)
> +	{
> +		if (!heap_continue)
> +		{
> +			/* Get the next TID from the index */
> +			tid = index_getnext_tid(scan, direction);
> +
> +			/* If we're out of index entries, we're done */
> +			if (tid == NULL)
> +				break;
> +
> +			/* For index-only scans, check the visibility map */
> +			if (index_only)
> +				all_visible = VM_ALL_VISIBLE(scan->heapRelation,
> +											 ItemPointerGetBlockNumber(tid),
> +											 &hscan->xs_vmbuffer);
> +		}
> +
> +		Assert(ItemPointerIsValid(&scan->xs_heaptid));
> +
> +		if (index_only)
> +		{
> +			/*
> +			 * We can skip the heap fetch if the TID references a heap page on
> +			 * which all tuples are known visible to everybody.  In any case,
> +			 * we'll use the index tuple not the heap tuple as the data
> +			 * source.
> +			 */
> +			if (!all_visible)
> +			{
> +				/*
> +				 * Rats, we have to visit the heap to check visibility.
> +				 */
> +				if (scan->instrument)
> +					scan->instrument->ntablefetches++;
> +
> +				if (!heapam_index_fetch_heap(scan, slot, &heap_continue))
> +				{

Still find it somewhat weird that we have local 'tid' variable, that we do not
use pass to heapam_index_fetch_heap(), because heapam_index_fetch_heap()
relies on index_getnext_tid() having stored the to-be-fetched tid in
xs_heaptid.

But I guess that's something to change at a later date.


> +					/*
> +					 * No visible tuple.  If caller set a visited-pages limit
> +					 * (only selfuncs.c does this), count distinct heap pages
> +					 * and give up once we've visited too many.
> +					 */
> +					if (unlikely(xs_visited_pages_limit > 0))
> +					{
> +						BlockNumber blk = ItemPointerGetBlockNumber(tid);
> +
> +						if (blk != last_visited_block)
> +						{
> +							last_visited_block = blk;
> +							if (++n_visited_pages > xs_visited_pages_limit)
> +								return false;	/* give up */
> +						}
> +					}
> +					continue;	/* no visible tuple, try next index entry */
> +				}
> +
> +				ExecClearTuple(slot);

Previously the ExecClearTuple() here had at least this comment:
/* We don't actually need the heap tuple for anything */

Now it looks even more confusing...


> @@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
>   * since we don't lock the visibility map page either, it's even possible that
>   * someone else could have changed the bit just before we look at it, but yet
>   * we might see the old value.  It is the caller's responsibility to deal with
> - * all concurrency issues!
> + * all concurrency issues!  In practice it can't be stale enough to matter for
> + * the primary use case: index-only scans that check whether a heap fetch can
> + * be skipped.
> + *
> + * The argument for why it can't be stale enough to matter for the primary use
> + * case is as follows:
> + *
> + * Inserts: we need to detect that a VM bit was cleared by an insert right
> + * away, because the new tuple is present in the index but not yet visible.
> + * Reading the TID from the index page (under a shared lock on the index
> + * buffer) is serialized with the insertion of the TID into the index (under
> + * an exclusive lock on the same index buffer).  Because the VM bit is cleared
> + * before the index is updated, and locking/unlocking of the index page acts
> + * as a full memory barrier, we are sure to see the cleared bit whenever we
> + * see a recently-inserted TID.
> + *
> + * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
> + * index page access, because deletes do not update the index page (only
> + * VACUUM removes the index TID).  So we may see a significantly stale value.
> + * However, we don't need to detect the delete right away, because the tuple
> + * remains visible until the deleting transaction commits or the statement
> + * ends (if it's our own transaction).  In either case, the lock on the VM
> + * buffer will have been released (acting as a write barrier) after clearing
> + * the bit.  And for us to have a snapshot that includes the deleting
> + * transaction (making the tuple invisible), we must have acquired
> + * ProcArrayLock after that time, acting as a read barrier.
>   */
>  uint8
>  visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)

Nice. Much better place for the comment.


> diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
> index 1408989c5..acc9f3e6a 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
>  	scan->xs_hitup = NULL;
>  	scan->xs_hitupdesc = NULL;
>
> +	scan->xs_visited_pages_limit = 0;
> +
>  	return scan;
>  }

Orthogonal: I suspect eventually we should pass the table to
RelationGetIndexScan(), have the tableam return how much space it needs, and
allocate it one chunk.




> From 85be7bf520fc2746f4ee73a0744c7aa04da55e52 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Tue, 10 Mar 2026 14:40:35 -0400
> Subject: [PATCH v19 03/18] heapam: Track heap block in IndexFetchHeapData.

LGTM.


> Subject: [PATCH v19 04/18] heapam: Keep buffer pins across index rescans.
>
> Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
> (xs_vmbuffer) during heapam_index_fetch_reset.  Retaining these pins
> saves cycles during tight nested loop joins and merge joins that
> frequently restore a saved mark, since the next tuple fetched after a
> rescan often falls on the same heap page.  It can also avoid repeated
> pinning and unpinning of the same buffer when rescans happen to revisit
> the same page.
>
> Preparation for an upcoming patch that will add the amgetbatch
> interface to enable optimizations such as I/O prefetching.
>
> Author: Peter Geoghegan <[email protected]>
> Reviewed-By: Andres Freund <[email protected]>
> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
> ---
>  src/backend/access/heap/heapam_indexscan.c | 27 ++++++++++++----------
>  src/backend/access/index/indexam.c         |  6 ++---
>  2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
> index 4b5e2b30d..82f135e1e 100644
> --- a/src/backend/access/heap/heapam_indexscan.c
> +++ b/src/backend/access/heap/heapam_indexscan.c
> @@ -59,18 +59,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
>  {
>  	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>
> -	if (BufferIsValid(hscan->xs_cbuf))
> -	{
> -		ReleaseBuffer(hscan->xs_cbuf);
> -		hscan->xs_cbuf = InvalidBuffer;
> -		hscan->xs_blk = InvalidBlockNumber;
> -	}
> +	/* Resets are a no-op (XXX amgetbatch commit resets xs_vm_items here) */
> +	(void) hscan;

I assume that's not a line you intend to commit?

LGTM otherwise.


Need to do something else for a while.  More another time.


Greetings,

Andres Freund

view thread (367+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox