Re: index prefetching - Tomas Vondra

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tomas Vondra <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Mon, 30 Sep 2024 23:16:25 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<CAH2-WznsJqDgr_0yUwApgYXi3cRZQbimFkiYRqqXhpMcw4s8ZQ@mail.gmail.com>
	<[email protected]>
	<CAH2-WznBuxhvsEgX3mYDjxKhQk9GFdF46vMfE2ugU6SUekHp_A@mail.gmail.com>
	<CAAKRu_ZKp1BCT+V324jENxKTfsetxJwxh309rJGWxebSggPisw@mail.gmail.com>
	<CAH2-Wzkrej9cXjERrA5p8pgD9QfR0LZwCCcgPPu6wiRgFpYVQQ@mail.gmail.com>
	<[email protected]>
	<CAH2-Wz=gMnsLQph1KM_xxTu-ZFRFqbDbK9tFBPTKcfXB1Z8=og@mail.gmail.com>
	<[email protected]>
	<CAH2-WzkpNN1+sovB8G=5dVwYW25=J6Qj4V9L7DzD26NTVQWM2w@mail.gmail.com>
	<[email protected]>
	<CAH2-WzkToUXuqfktW3CPoKq3odtNChkFwQFWHARz=n-h_Zm2Kw@mail.gmail.com>
	<[email protected]>
	<CAH2-Wzn3o6AE2x7v8Upmk8zPrDw8UUozpd4QJx8AiehHGmA02w@mail.gmail.com>
	<[email protected]>
	<[email protected]>

Hi,

Here's another version of this patch series, with a couple significant
improvements, mostly in the indexam.c and executor layers. The AM code
remains almost untouched.

I have focused on the simplification / cleanup of the executor code
(nodeIndexscan and nodeIndexonlyscan). In the previous version there was
quite a bit of duplicated code - both for the "regular" index scans and
index-only scans, the "while getnext" block was copied, calling either
the non-batched or batched functions.

That is now mostly gone. I managed to move 99% of the differences to the
indexam.c layer, so that the executor simply calls index_getnext_tid()
or index_getnext_slot(), and that decides *internally* whether to use
the batched version, or not. This means the only new function added to
the indexam API is index_batch_add(), which the index AMs use to add
items into the batch. For the executor the code remains the same.

The only exception is that index-only scans need a way to guide the
prefetching based on the visibility map (we don't want to prefetch
all-visible pages, because skipping those is the whole point of IOS).
And we also want a way to share the VM check, so that it doesn't need to
happen twice. Because for fully-cached workloads this is too expensive.

Doing the first part is trivial - we simply define a callback for the
batching, responsible for inspecting the VM and making a decision.
That's easy, and fairly clean. Passing the VM check result back is a bit
awkward, though. The current patch deals with it by just executing the
callback again (which just returns the cached result), or doing the VM
check locally (for non-batched version). It's not pretty, because it
leaks knowledge of the batching into the executor.

I'd appreciate ideas how to solve this in a nicer way.

I've also split the nbtree changes into a separate patch. It used to be
included in the first patch, but I've decided to keep it separate, just
like for the other AMs.

I'm now fairly happy with both the executor layer and the (much smaller)
indexam.c code, and I think it's in a good enough shape for a review.

The next item on my TODO is cleanup of the nbtree code, particularly the
mark/restore part in patch 0003. So I'll work on that next. I also plan
to get back to the index_batch_prefetch() code, which is not wrong but
would benefit from a bit of cleanup / clarification etc.


regards

-- 
Tomas Vondra


Attachments:

  [text/x-patch] v20240930-0001-WIP-index-batching-prefetching.patch (59.5K, 2-v20240930-0001-WIP-index-batching-prefetching.patch)
  download | inline diff:
From 88046ae9afd637cad22c5728f337794a210cfbc3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v20240930 1/6] WIP: index batching / prefetching

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.

It is up to the index AM to return only batches that it can handle
internally. For example, most of the later patches adding support for
batching to relevant index AMs (btree, hash, gist, sp-gist) restrict the
batches to a single leaf page. This makes implementation of batching
much simpler, with only minimal changes to the index AMs, but it's not a
hard requirement. The index AM can produce batches spanning arbitrary
number of leaf pages. This is left as a possible future improvement.

Most of the batching/prefetching logic happens in indexam.c. This means
the executor code can continue to call the interface just like before.

The only "violation" happens in index-only scans, which need to check
the visibility map both when the prefetching pages (we don't want to
prefetch pages that are unnecessary) and later when reading the data.
For cached data the visibility map checks can be fairly expensive, so
it's desirable to keep and reuse the result of the first check.

At the moment, the prefetching does not handle mark/restore plans. This
is doable, but requires additional synchronization between the batching
and index AM code in the "opposite direction".

This patch does not actually add batching to any of the index AMs, it's
just the common infrastructure.

TODO Add the new index AM callback to sgml docs.
---
 src/backend/access/heap/heapam_handler.c      |   7 +-
 src/backend/access/index/genam.c              |  23 +-
 src/backend/access/index/indexam.c            | 808 +++++++++++++++++-
 src/backend/executor/execIndexing.c           |   7 +-
 src/backend/executor/execReplication.c        |   9 +-
 src/backend/executor/nodeIndexonlyscan.c      | 106 ++-
 src/backend/executor/nodeIndexscan.c          |  36 +-
 src/backend/utils/adt/selfuncs.c              |   7 +-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   5 +
 src/include/access/genam.h                    |  14 +-
 src/include/access/relscan.h                  |  64 ++
 src/include/nodes/execnodes.h                 |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 1084 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d43..86da23e41f4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -750,7 +750,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering. Seems like it
+		 * might be a pretty substantial win.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 52fde5cc4d4..0186c251ef7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -445,8 +445,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over
+		 * time. But then we need to be careful about infinite recursion when
+		 * looking up effective_io_concurrency for a tablespace in the
+		 * catalog.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, false);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -708,8 +718,17 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time.
+	 * But then we need to be careful about infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace in the catalog.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, false);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1859be614c0..2849ab97cdf 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_batch_add		- add an item (TID, itup) to the batch
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -54,10 +55,14 @@
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memutils.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
+/* enable reading batches / prefetching of TIDs from the index */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +114,15 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan,
+								ScanDirection direction);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+static void index_batch_prefetch(IndexScanDesc scan,
+								 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -256,7 +270,8 @@ IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
 	IndexScanDesc scan;
 
@@ -274,6 +289,24 @@ index_beginscan(Relation heapRelation,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -333,6 +366,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batch = NULL;
+
 	return scan;
 }
 
@@ -368,6 +407,18 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /* ----------------
@@ -444,6 +495,18 @@ index_restrpos(IndexScanDesc scan)
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /*
@@ -539,7 +602,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -562,6 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -583,6 +665,53 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * When using batching (which may be disabled for various reasons (e.g.
+	 * through a GUC, the index AM not supporting it) do the old approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
+	 */
+	if (scan->xs_batch != NULL)
+	{
+batch_loaded:
+		/* Try getting a TID from the current batch (if we have one). */
+		while (index_batch_getnext_tid(scan, direction) != NULL)
+		{
+			/*
+			 * We've successfully loaded a TID from the batch, so issue
+			 * prefetches for future TIDs if needed.
+			 */
+			index_batch_prefetch(scan, direction);
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We either don't have any batch yet, or we've already processed
+		 * all items from the current batch. Try loading the next one.
+		 *
+		 * If we succeed, issue prefetches (using the current prefetch
+		 * distance without ramp up), and then go back to returning the
+		 * TIDs from the batch.
+		 *
+		 * XXX Maybe do this as a simple while/for loop without the goto.
+		 */
+		if (index_batch_getnext(scan, direction))
+		{
+			index_batch_prefetch(scan, direction);
+			goto batch_loaded;
+		}
+
+		return NULL;
+	}
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -651,7 +780,19 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * RelationGetIndexScan().
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->xs_batch == NULL)
+		{
+			scan->kill_prior_tuple = all_dead;
+		}
+		else if (all_dead)
+		{
+			/* batch case - record the killed tuple in the batch */
+			if (scan->xs_batch->nKilledItems < scan->xs_batch->maxSize)
+				scan->xs_batch->killedItems[scan->xs_batch->nKilledItems++]
+					= scan->xs_batch->currIndex;
+		}
+	}
 
 	return found;
 }
@@ -1039,3 +1180,664 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING AND PREFETCHING
+ *
+ * Allows reading chunks of items from an index, instead of reading them
+ * one by one. This reduces the overhead of accessing index pages, and
+ * also allows acting on "future" TIDs - e.g. we can prefetch heap pages
+ * that will be needed, etc.
+ *
+ *
+ * index AM contract
+ * -----------------
+ *
+ * To support batching, the index AM needs to implement an optional callback
+ * amgetbatch() which loads data into the batch (in the scan descriptor).
+ *
+ * The index AM also needs to ensure it can perform all optimizations for
+ * all TIDs in the current batch. A good example of is the kill_prior_tuple
+ * optimization - with batching, the index AM may receive the information
+ * which tuples are to be killed with a delay - when loading the next
+ * batch, when ending/restarting the scan, etc. The AM needs to ensure it
+ * can still process such information, to keep the optimization effective.
+ *
+ * The AM may also need to keep pins required by the whole batch (not just
+ * the last tuple), etc.
+ *
+ * What this means/requires is very dependent on the index AM, of course.
+ * For B-Tree (and most other index AMs), batches spanning multiple leaf
+ * pages would be problematic. Such batches would work for basic index
+ * scans, but the kill_prior_tuple would be an issue - the AMs keep only
+ * a single leaf pinned. We'd either need to keep multiple pins, or allow
+ * reading older leaf pages pages (which might have been modified). Index
+ * only-scans is challenging too - we keep IndexTuple pointers into the
+ * leaf pages, which requires keeping those pins too.
+ *
+ * To solve this, we give the AM some control over batch boundaries. It is
+ * up to the index AM to pick which range of index items to load into the
+ * batch, and how to ensure all the optimizations are possible, keep pins,
+ * and so on. The index AM may use information about the batch (in the
+ * scan descriptor, maintained by indexam.c code), and may also keep some
+ * private information (in the existing "opaque" scan field).
+ *
+ * For most index AMs the easiest way is to not load batches spanning
+ * multiple leaf pages. This may impact the efficiency, especially for
+ * indexes with wide index tuples, as it means batches close to the end
+ * of the leaf page may be smaller.
+ *
+ * Note: There already is a pipeline break for prefetching - as we are
+ * getting closer to the end of a batch, we can't prefetch further than
+ * that, and the effective prefetch distance drops to 0.
+ *
+ * The alternative would be to make the index AMs more complex, to keep
+ * more leaf pages pinned, etc. The current model does not prohibit the
+ * index AMs from implementing that - it's entirely possible to keep the
+ * additional information in the "opaque" structure (say, list of pinned
+ * pages, and other necessary details).
+ *
+ * But that does not seem like a good trade off, as it's subject to
+ * "diminishing returns" - we see significant gains initially (with even
+ * small batches / prefetch distance), and as the batch grows the gains
+ * get smaller and smaller. It does not seem worth the complexity of
+ * pinning more pages etc. at least for the first version.
+ *
+ * To deal with the "prefetch pipeline break", that could be addressed by
+ * allowing multiple in-fligt batches - e.g. with prefetch distance 64
+ * we might have three batches of 32 items each, to prefetch far ahead.
+ * But that's not what this patch does yet.
+ *
+ *
+ * batch = sliding window
+ * ----------------------
+ *
+ * A good way to visualize a batch is a sliding window over the array
+ * of items on a leaf page. In the simplest example (forward scan with no
+ * changes of direction), we slice the array into smaller chunks, and
+ * then process each of those chunks.
+ *
+ * The batch size is adaptive - it starts small (only 8 elements) and
+ * increases as we read more batches (up to 64 elements). We don't want
+ * to regress cases that only need a single item (e.g. LIMIT 1 queries),
+ * and loading/copying a lot of data might cause that. So we start small
+ * and increase the size - that still improves cases reading a lot of
+ * data from the index, without hurting small queries.
+ *
+ * Note: This gradual ramp up is for batch size, independent of what we
+ * do for prefetch. The prefetch distance is gradually increased too, but
+ * it's independent / orthogonal to the batch size. The batch size limits
+ * how far ahead we can prefetch, of course.
+ *
+ * Note: The current limits on batch size (initial 8, maximum 64) are
+ * quite arbitrary, it just seemed those values are sane. We could adjust
+ * the initial size, but I don't think it'd make a fundamental difference.
+ * Growing the batches faster/slower has bigger impact.
+ *
+ * The maximum batch size does not matter much - it's true a btree index can
+ * have up to ~1300 items per 8K leaf page, but in most cases the actual
+ * number is lower, perhaps ~300. That's not that far from 64.
+ *
+ * Each batch has a firstIndex/lastIndex to track which part of the leaf
+ * page it currently represents.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ *
+ * If we decide a tuple can be killed, the batch item is marked accordingly,
+ * and the flag is reset to false (so that the index AM does not do something
+ * silly to a random tuple it thinks is "current").
+ *
+ * Then the next time the AM decides it's time to kill tuples, the AM needs
+ * to look at the batch and consider the tuples marked to be killed. B-Tree
+ * simply adds those TIDs to the regular "killItems" array.
+ *
+ *
+ * mark/restore
+ * ------------
+ *
+ * With batching, the index AM does not know the the "current" position on
+ * the leaf page - we don't propagate this to the index AM while walking
+ * items in the batch. To make ammarkpos() work, the index AM has to check
+ * the current position in the batch, and translate it to the proper page
+ * position, using the private information (about items in the batch).
+ *
+ * XXX This needs more work, I don't quite like how the two layers interact,
+ * it seems quite wrong to look at the batch info directly.
+ */
+
+/*
+ * Comprehensive check of various invariants on the index batch. Makes sure
+ * the indexes are set as expected, the buffer size is within limits, and
+ * so on.
+ */
+static void
+AssertCheckBatchInfo(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* all the arrays need to be allocated */
+	Assert((scan->xs_batch->heaptids != NULL) &&
+		   (scan->xs_batch->killedItems != NULL) &&
+		   (scan->xs_batch->privateData != NULL));
+
+	/* if IndexTuples expected, should be allocated too */
+	Assert(!(scan->xs_want_itup && (scan->xs_batch->itups == NULL)));
+
+	/* Various check on batch sizes */
+	Assert((scan->xs_batch->initSize >= 0) &&
+		   (scan->xs_batch->initSize <= scan->xs_batch->currSize) &&
+		   (scan->xs_batch->currSize <= scan->xs_batch->maxSize) &&
+		   (scan->xs_batch->maxSize <= 1024));	/* arbitrary limit */
+
+	/* Is the number of in the batch TIDs in a valid range? */
+	Assert((scan->xs_batch->nheaptids >= 0) &&
+		   (scan->xs_batch->nheaptids <= scan->xs_batch->maxSize));
+
+	/*
+	 * The current item must be between -1 and nheaptids. Those two extreme
+	 * values are starting points for forward/backward scans.
+	 */
+	Assert((scan->xs_batch->currIndex >= -1) &&
+		   (scan->xs_batch->currIndex <= scan->xs_batch->nheaptids));
+
+	/* check prefetch data */
+	Assert((scan->xs_batch->prefetchTarget >= 0) &&
+		   (scan->xs_batch->prefetchTarget <= scan->xs_batch->prefetchMaximum));
+
+	Assert((scan->xs_batch->prefetchIndex >= -1) &&
+		   (scan->xs_batch->prefetchIndex <= scan->xs_batch->nheaptids));
+
+	for (int i = 0; i < scan->xs_batch->nheaptids; i++)
+		Assert(ItemPointerIsValid(&scan->xs_batch->heaptids[i]));
+#endif
+}
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch->nheaptids == 0)
+
+/*
+ * Did we process all items? For forward scan it means the index points to the
+ * last item, for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
+		((scan)->xs_batch->currIndex == 0))
+
+/* Does the batch items in the requested direction? */
+#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	/*
+	 * Reset the current/prefetch positions in the batch.
+	 *
+	 * XXX Done before calling amgetbatch(), so that it sees the index as
+	 * invalid, batch as empty, and can add items.
+	 *
+	 * XXX Intentionally does not reset the nheaptids, because the AM does
+	 * rely on that when processing killed tuples. Maybe store the killed
+	 * tuples differently?
+	 */
+	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->nheaptids = 0;
+
+	/*
+	 * Reset the memory context with all per-batch data, allocated by the AM.
+	 * This might be tuples, or anything else the AM needs.
+	 *
+	 * XXX Make sure to reset the tuples, because the AM may do something with
+	 * them later (e.g. release them, as getNextNearest in gist), but we may
+	 * release them by the MemoryContextReset() call.
+	 *
+	 * This might break the AM if it relies on them pointing to the last
+	 * tuple, but at least it has the chance to do the right thing by checking
+	 * if the pointer is NULL.
+	 */
+	scan->xs_itup = NULL;
+	scan->xs_hitup = NULL;
+
+	MemoryContextReset(scan->xs_batch->ctx);
+
+	/*
+	 * The AM's amgetbatch proc loads a chunk of TIDs matching the scan keys,
+	 * and puts the TIDs into scan->xs_batch->heaptids.  It should also set
+	 * scan->xs_recheck and possibly
+	 * scan->xs_batch->itups/scan->xs_batch->hitups, though we pay no
+	 * attention to those fields here.
+	 *
+	 * FIXME At the moment this does nothing with hitup. Needs to be fixed?
+	 */
+	found = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return false;
+	}
+
+	/* We should have a non-empty batch with items. */
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch->nheaptids);
+
+	/*
+	 * Set the prefetch index to the first item in the loaded batch (we expect
+	 * the index AM to set that).
+	 *
+	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
+	 * more like indexam.c responsibility rather than something every index AM
+	 * should be doing (in _bt_first_batch etc.).
+	 *
+	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
+	 * for setting which fields. This needs clarification.
+	 */
+	scan->xs_batch->prefetchIndex = scan->xs_batch->currIndex;
+
+	/*
+	 * Try to increase the size of the batch. Intentionally done after the AM
+	 * call, so that the new value applies to the next batch. Otherwise we
+	 * would always skip the initial batch size.
+	 */
+	scan->xs_batch->currSize = Min(scan->xs_batch->currSize + 1,
+								   scan->xs_batch->maxSize);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* Return the batch of TIDs we found. */
+	return true;
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * Same calling convention as index_getnext_tid(), except that NULL means
+ * no more items in the current batch, there may be more batches.
+ *
+ * XXX This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup.
+ *
+ * FIXME Should this set xs_hitup?
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * Bail out if he batch does not have more items in the requested directio
+	 * (either empty or everthing processed).
+	 */
+	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
+		return NULL;
+
+	/*
+	 * Advance to the next batch item - we know it's not empty and there are
+	 * items to process, so this is valid.
+	 */
+	if (ScanDirectionIsForward(direction))
+		scan->xs_batch->currIndex++;
+	else
+		scan->xs_batch->currIndex--;
+
+	/*
+	 * Next TID from the batch, optionally also the IndexTuple/HeapTuple.
+	 *
+	 * XXX Not sure how to decide which of the tuples to set, seems easier to
+	 * just set both, one of them will be NULL.
+	 *
+	 * XXX Do we need to reset the itups/htups array between batches? Doesn't
+	 * seem necessary, but maybe we could get bogus data?
+	 */
+	scan->xs_heaptid = scan->xs_batch->heaptids[scan->xs_batch->currIndex];
+	if (scan->xs_want_itup)
+	{
+		scan->xs_itup = scan->xs_batch->itups[scan->xs_batch->currIndex];
+		scan->xs_hitup = scan->xs_batch->htups[scan->xs_batch->currIndex];
+	}
+
+	scan->xs_recheck = scan->xs_batch->recheck[scan->xs_batch->currIndex];
+
+	/*
+	 * If there are order-by clauses, point to the appropriate chunk in the
+	 * arrays.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->numberOfOrderBys * scan->xs_batch->currIndex;
+
+		scan->xs_orderbyvals = &scan->xs_batch->orderbyvals[idx];
+		scan->xs_orderbynulls = &scan->xs_batch->orderbynulls[idx];
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ *		index_batch_prefetch - prefetch pages for TIDs in current batch
+ *
+ * The prefetch distance is increased gradually, similar to what we do for
+ * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
+ * iteration increment the distance up to prefetchMaximum.
+ *
+ * The prefetch distance is reset (to 0) only on rescans, not between batches.
+ *
+ * It's possible to provide an index_prefetch_callback callback, to affect
+ * which items need to be prefetched. With prefetch_callback=NULL, all
+ * items are prefetched. With the callback provided, the item is prefetched
+ * iff the callback and returns true.
+ *
+ * The "arg" argument is used to pass a state for the plan node invoking the
+ * function, and is then passed to the callback. This means the callback is
+ * specific to the plan state.
+ *
+ * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
+ * tablespace options.
+ *
+ * XXX For accesses that change scan direction, we may do a lot of unnecessary
+ * prefetching (because we will re-issue prefetches for what we recently read).
+ * I'm not sure if there's a simple way to track what was already prefetched.
+ * Maybe we could count how far we got (in the forward direction), keep that
+ * as a watermark, and never prefetch again below it.
+ *
+ * XXX Maybe wrap this in ifdef USE_PREFETCH?
+ * ----------------
+ */
+static void
+index_batch_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	int			prefetchStart,
+				prefetchEnd;
+
+	IndexPrefetchCallback	prefetch_callback = scan->xs_batch->prefetchCallback;
+	void *arg = scan->xs_batch->prefetchArgument;
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Where should we start to prefetch? */
+		prefetchStart = Max(scan->xs_batch->currIndex,
+							scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchEnd = Min((scan->xs_batch->currIndex + 1) + scan->xs_batch->prefetchTarget,
+						  scan->xs_batch->nheaptids);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchEnd;
+	}
+	else
+	{
+		/* Where should we start to prefetch? */
+		prefetchEnd = Min(scan->xs_batch->currIndex,
+						  scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchStart = Max((scan->xs_batch->currIndex - 1) - scan->xs_batch->prefetchTarget,
+							-1);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchStart;
+	}
+
+	/*
+	 * It's possible we get inverted prefetch range after a restrpos() call,
+	 * because we intentionally don't reset the prefetchIndex - we don't want
+	 * to prefetch pages over and over in this case. We'll do nothing in that
+	 * case, except for the AssertCheckBatchInfo().
+	 *
+	 * FIXME I suspect this actually does not work correctly if we change the
+	 * direction, because the prefetchIndex will flip between two extremes
+	 * thanks to the Min/Max.
+	 */
+
+	/*
+	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
+	 * intentionally do this after calculating start/end, so that we start
+	 * actually prefetching only after the first item.
+	 */
+	scan->xs_batch->prefetchTarget = Min(scan->xs_batch->prefetchTarget + 1,
+										 scan->xs_batch->prefetchMaximum);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* finally, do the actual prefetching */
+	for (int i = prefetchStart; i < prefetchEnd; i++)
+	{
+		/* skip block if the provided callback says so */
+		if (prefetch_callback && !prefetch_callback(scan, arg, i))
+			continue;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
+					   ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[i]));
+	}
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, but only if batch supported */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->xs_batch = palloc0(sizeof(IndexScanBatchData));
+
+	/*
+	 * Set some reasonable batch size defaults.
+	 *
+	 * XXX Maybe should depend on prefetch distance, or something like that?
+	 * The initSize will affect how far ahead we can prefetch.
+	 */
+	scan->xs_batch->maxSize = 64;
+	scan->xs_batch->initSize = 8;
+	scan->xs_batch->currSize = scan->xs_batch->initSize;
+
+	/* initialize prefetching info */
+	scan->xs_batch->prefetchMaximum =
+		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
+	scan->xs_batch->prefetchTarget = 0;
+	scan->xs_batch->prefetchIndex = 0;
+
+	/* */
+	scan->xs_batch->currIndex = -1;
+
+	/* Preallocate the largest allowed array of TIDs. */
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->heaptids = palloc0(sizeof(ItemPointerData) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX We can't check scan->xs_want_itup, because that's set only after
+	 * the scan is initialized (and we initialize in beginscan). Maybe we
+	 * could (or should) allocate lazily.
+	 */
+	scan->xs_batch->itups = palloc(sizeof(IndexTuple) * scan->xs_batch->maxSize);
+	scan->xs_batch->htups = palloc(sizeof(HeapTuple) * scan->xs_batch->maxSize);
+
+	scan->xs_batch->recheck = palloc(sizeof(bool) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe use a more compact bitmap? We need just one bit per element,
+	 * not a bool. This is easier / more convenient to manipulate, though.
+	 *
+	 * XXX Maybe should allow more items thant the max batch size?
+	 */
+	scan->xs_batch->nKilledItems = 0;
+	scan->xs_batch->killedItems = (int *) palloc0(sizeof(int) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe allocate only when actually needed? Also, shouldn't we have a
+	 * memory context for the private data?
+	 */
+	scan->xs_batch->privateData = (Datum *) palloc0(sizeof(Datum) * scan->xs_batch->maxSize);
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			cnt = (scan->xs_batch->maxSize * scan->numberOfOrderBys);
+
+		scan->xs_batch->orderbyvals = (Datum *) palloc0(sizeof(Datum) * cnt);
+		scan->xs_batch->orderbynulls = (bool *) palloc0(sizeof(Datum) * cnt);
+	}
+	else
+	{
+		scan->xs_batch->orderbyvals = NULL;
+		scan->xs_batch->orderbynulls = NULL;
+	}
+
+	scan->xs_batch->ctx = AllocSetContextCreate(CurrentMemoryContext,
+												"indexscan batch context",
+												ALLOCSET_DEFAULT_SIZES);
+
+	/* comprehensive checks */
+	AssertCheckBatchInfo(scan);
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
+ * correct, because for backwards scans is wrong.
+ */
+static void
+index_batch_reset(IndexScanDesc scan)
+{
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->currIndex = -1;
+}
+
+/*
+ * index_batch_add
+ *		Add an item to the batch.
+ *
+ * The item is always a TID, and then also IndexTuple if requested (for IOS).
+ * Items are always added from the beginning (index 0).
+ *
+ * Returns true when adding the item was successful, or false when the batch
+ * is full (and the item should be added to the next batch).
+ */
+bool
+index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+				IndexTuple itup, HeapTuple htup)
+{
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	/* don't add TIDs beyond the current batch size */
+	if (INDEX_BATCH_IS_FULL(scan))
+		return false;
+
+	/*
+	 * There must be space for at least one entry.
+	 *
+	 * XXX Seems redundant with the earlier INDEX_BATCH_IS_FULL check.
+	 */
+	Assert(scan->xs_batch->nheaptids < scan->xs_batch->currSize);
+	Assert(scan->xs_batch->nheaptids >= 0);
+
+	scan->xs_batch->heaptids[scan->xs_batch->nheaptids] = tid;
+	scan->xs_batch->privateData[scan->xs_batch->nheaptids] = (Datum) 0;
+
+	if (scan->xs_want_itup)
+	{
+		scan->xs_batch->itups[scan->xs_batch->nheaptids] = itup;
+		scan->xs_batch->htups[scan->xs_batch->nheaptids] = htup;
+	}
+
+	scan->xs_batch->recheck[scan->xs_batch->nheaptids] = recheck;
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->xs_batch->nheaptids * scan->numberOfOrderBys;
+
+		memcpy(&scan->xs_batch->orderbyvals[idx], scan->xs_orderbyvals, sizeof(Datum) * scan->numberOfOrderBys);
+		memcpy(&scan->xs_batch->orderbynulls[idx], scan->xs_orderbynulls, sizeof(bool) * scan->numberOfOrderBys);
+	}
+
+	scan->xs_batch->nheaptids++;
+
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	return true;
+}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f9a2fac79e4..742a963bc29 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -809,7 +809,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+
+	/*
+	 * XXX Does not seem useful to do prefetching for checks of constraints.
+	 * We would probably need just the first item anyway.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 54025c9f150..6be3744361d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -244,8 +244,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching is pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 612c6738950..c030c0df6fe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,12 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *data, int index);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* XXX default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -93,15 +98,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/* Also set the prefetch callback info, if baching enabled. */
+		if (scandesc->xs_batch != NULL)
+		{
+			scandesc->xs_batch->prefetchCallback = ios_prefetch_block;
+			scandesc->xs_batch->prefetchArgument = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -119,10 +131,38 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/* */
+		if (scandesc->xs_batch == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+						  ItemPointerGetBlockNumber(tid),
+						  &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/* Is the index of the current item valid for the batch? */
+			Assert((scandesc->xs_batch->currIndex >= 0) &&
+				   (scandesc->xs_batch->currIndex < scandesc->xs_batch->nheaptids));
+
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page has to be all-visible.
+			 *
+			 * XXX It's a bir weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that. Maybe we could/should have a more direct
+			 * way?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  scandesc->xs_batch->currIndex);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -157,16 +197,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
 			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+				continue;	/* no visible tuple, try next index entry */
 
 			ExecClearTuple(node->ioss_TableSlot);
 
@@ -574,6 +612,16 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -735,12 +783,20 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -780,12 +836,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/* XXX do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -797,3 +856,34 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, int index)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+
+	if (scan->xs_batch->privateData[index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &scan->xs_batch->heaptids[index];
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		scan->xs_batch->privateData[index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (scan->xs_batch->privateData[index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..8bbd3606566 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -110,7 +110,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -200,12 +201,15 @@ IndexNextWithReorder(IndexScanState *node)
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
+		 *
+		 * XXX Should we use batching here? And can we with reordering?
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -942,6 +946,20 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 */
+	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1670,12 +1688,17 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1713,12 +1736,17 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f482..2c9c829c288 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6340,9 +6340,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58b..62a7f63a613 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -789,6 +789,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..b2509337755 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -398,6 +398,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index c51de742ea0..966a25d9ba3 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef bool (*amgetbatch_function) (IndexScanDesc scan,
+									 ScanDirection direction);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -288,6 +292,7 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c25f5d11b53..1d9a0868a9b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -88,6 +89,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -132,6 +134,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -155,7 +159,8 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,7 +178,8 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -182,6 +188,10 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+/* index batching/prefetching */
+extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+							IndexTuple itup, HeapTuple htup);
+
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 114a85dc47c..3d767e14356 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -107,6 +107,9 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -152,6 +155,9 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* Information used by batched index scans. */
+	IndexScanBatchData *xs_batch;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
@@ -167,6 +173,64 @@ typedef struct IndexScanDescData
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
 
+/*
+ * Typedef for callback function to determine if an item in index scan should
+ * be prefetched.
+ */
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData *scan,
+									   void *arg, int index);
+
+/*
+ * Data about the current TID batch returned by the index AM.
+ *
+ * XXX Maybe this should be a separate struct instead, and the scan
+ * descriptor would have only a pointer, initialized only when the
+ * batching is actually used?
+ *
+ * XXX It's not quite clear which part of this is managed by indexam and
+ * what's up to the actual index AM implementation. Needs some clearer
+ * boundaries.
+ *
+ * XXX Should we have a pointer for optional state managed by the AM? Some
+ * custom AMs may need more per-batch information, not just the fields we
+ * have here.
+ */
+typedef struct IndexScanBatchData
+{
+	/* batch size - maximum, initial, current (with ramp up) */
+	int			maxSize;
+	int			initSize;
+	int			currSize;
+
+	/* memory context for per-batch data */
+	MemoryContext ctx;
+
+	/* batch prefetching */
+	int			prefetchTarget; /* current prefetch distance */
+	int			prefetchMaximum;	/* maximum prefetch distance */
+	int			prefetchIndex;	/* next item to prefetch */
+
+	IndexPrefetchCallback	prefetchCallback;
+	void				   *prefetchArgument;
+
+	/* batch contents (TIDs, index tuples, kill bitmap, ...) */
+	int			currIndex;		/* index of the current item */
+	int			nheaptids;		/* number of TIDs in the batch */
+	ItemPointerData *heaptids;	/* TIDs in the batch */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+	/* list of killed items */
+	int			nKilledItems;	/* number of killedItems elements */
+	int		   *killedItems;	/* list of indexes to kill */
+} IndexScanBatchData;
+
 /* Generic structure for parallel scans */
 typedef struct ParallelIndexScanDescData
 {
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index aab59d681cf..aa8bc96cd37 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1650,6 +1650,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1677,6 +1678,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1698,6 +1703,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1719,6 +1725,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..14b38ed4d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -157,6 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5fabb127d7e..3567a1c315f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1219,6 +1219,7 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
 IndexScanDesc
 IndexScanState
 IndexStateFlagsAction
@@ -3284,6 +3285,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.46.1



  [text/x-patch] v20240930-0002-WIP-batching-for-nbtree-indexes.patch (15.0K, 3-v20240930-0002-WIP-batching-for-nbtree-indexes.patch)
  download | inline diff:
From b67a0ec75b66780e0abb6ef78ff1cc63cf755f16 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:48:39 +0200
Subject: [PATCH v20240930 2/6] WIP: batching for nbtree indexes

Adds batching/prefetching for btree indexes. Returns only batches from a
single leaf page. Does not support mark/restore yet.
---
 src/backend/access/nbtree/nbtree.c    |  65 ++++++
 src/backend/access/nbtree/nbtsearch.c | 280 ++++++++++++++++++++++++++
 src/include/access/nbtree.h           |  15 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 361 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4fc9..e67b3938122 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -142,6 +142,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -260,6 +261,53 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+bool
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (!BTScanPosIsValid(so->currPos))
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Check to see if we should kill tuples from the previous batch.
+			 */
+			_bt_kill_batch(scan);
+
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, dir);
+		}
+
+		/* If we have a tuple, return it ... */
+		if (res)
+			break;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -339,6 +387,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* batch range, initially empty */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -365,6 +417,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -407,6 +462,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* Reset the batch (even if not batched scan) */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
 }
 
 /*
@@ -420,6 +479,9 @@ btendscan(IndexScanDesc scan)
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -500,6 +562,9 @@ btrestrpos(IndexScanDesc scan)
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_bt_kill_batch(scan);
+
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fff7c89eadb..66ca064b7e8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1531,6 +1531,283 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+
+static void
+AssertCheckBTBatchInfo(BTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ * _bt_copy_batch
+ *		Copy a section of the leaf page into the batch.
+ */
+static void
+_bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+			   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		BTScanPosItem *currItem = &so->currPos.items[start];
+		IndexTuple	itup = NULL;
+
+		if (scan->xs_want_itup)
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, scan->xs_recheck, itup, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+/*
+ *	_bt_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch (we reset those indexes both in
+	 * btbeginscan and btrescan). So why not an assert? We can get here in
+	 * different ways too - not just after beginscan/rescan, but also when
+	 * iterating over ScalarArrayOps - in which case we'll see the last batch
+	 * of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_bt_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+bool
+_bt_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_bt_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_bt_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxTIDsPerBTreePage * sizeof(int));
+		if (so->numKilled < MaxTIDsPerBTreePage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2052,6 +2329,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(BTScanPosIsValid(so->currPos));
 
+	/* Transfer killed items from the batch to the regular array. */
+	_bt_kill_batch(scan);
+
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
 		_bt_killitems(scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d64300fb973..df32382b311 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1037,6 +1037,14 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+/* Information about the current batch (in batched index scans) */
+typedef struct BTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} BTBatchInfo;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1056,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	BTBatchInfo batch;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1172,6 +1183,7 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1277,6 +1289,9 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3567a1c315f..be89be987a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -188,6 +188,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
 BTBuildState
 BTCycleId
 BTDedupInterval
-- 
2.46.1



  [text/x-patch] v20240930-0003-PoC-support-for-mark-restore-for-nbtree.patch (13.1K, 4-v20240930-0003-PoC-support-for-mark-restore-for-nbtree.patch)
  download | inline diff:
From 03f64b0a122da975598049715b0971f0f54a4365 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:48:46 +0200
Subject: [PATCH v20240930 3/6] PoC: support for mark/restore for nbtree

Adds support for batching for mark/restore with btree indexes, which
makes prefetching work with mergejoins, for example.
---
 src/backend/access/index/indexam.c       |  40 ++++----
 src/backend/access/nbtree/nbtree.c       | 111 +++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c    |   2 +-
 src/backend/executor/nodeIndexonlyscan.c |  11 +--
 src/backend/executor/nodeIndexscan.c     |  11 +--
 src/include/access/nbtree.h              |   2 +
 src/include/access/relscan.h             |   6 ++
 7 files changed, 152 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 2849ab97cdf..b117853f8d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -497,16 +497,9 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_indam->amrestrpos(scan);
 
 	/*
-	 * Reset the batch, to make it look empty.
-	 *
-	 * Done after the amrescan() call, in case the AM needs some of the batch
-	 * info (e.g. to properly transfer the killed tuples).
-	 *
-	 * XXX This is a bit misleading, because index_batch_reset does not reset
-	 * the killed tuples. So if that's the only justification, we could have
-	 * done it before the call.
+	 * Don't reset the batch here - amrestrpos should have has already loaded
+	 * the new batch, so don't throw that away.
 	 */
-	index_batch_reset(scan);
 }
 
 /*
@@ -1299,12 +1292,9 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
  *
  * With batching, the index AM does not know the the "current" position on
  * the leaf page - we don't propagate this to the index AM while walking
- * items in the batch. To make ammarkpos() work, the index AM has to check
+ * items in the batch. To make ammarkpos() work, the index AM can check
  * the current position in the batch, and translate it to the proper page
  * position, using the private information (about items in the batch).
- *
- * XXX This needs more work, I don't quite like how the two layers interact,
- * it seems quite wrong to look at the batch info directly.
  */
 
 /*
@@ -1372,9 +1362,17 @@ AssertCheckBatchInfo(IndexScanDesc scan)
 		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
 		((scan)->xs_batch->currIndex == 0))
 
-/* Does the batch items in the requested direction? */
+/*
+ * Does the batch items in the requested direction? The batch must be non-empty
+ * and we should not have reached the end of the batch (in the direction).
+ * Also, if we just restored the position after mark/restore, there should be
+ * at least one item to process (we won't advance  on the next call).
+ *
+ * XXX This is a bit confusing / ugly, probably should rethink how we track
+ * empty batches, and how we handle not advancing after a restore.
+ */
 #define INDEX_BATCH_HAS_ITEMS(scan, direction) \
-	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+	(!INDEX_BATCH_IS_EMPTY(scan) && (!INDEX_BATCH_IS_PROCESSED(scan, direction) || scan->xs_batch->restored))
 
 
 /* ----------------
@@ -1527,8 +1525,17 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/*
 	 * Advance to the next batch item - we know it's not empty and there are
 	 * items to process, so this is valid.
+	 *
+	 * However, don't advance if this is the first getnext_tid() call after
+	 * amrestrpos(). That sets the position on the correct item, and advancing
+	 * here would skip it.
+	 *
+	 * XXX The "restored" flag is a bit weird. Can we do this without it? May
+	 * need to rethink when/how we advance the batch index. Not sure.
 	 */
-	if (ScanDirectionIsForward(direction))
+	if (scan->xs_batch->restored)
+		scan->xs_batch->restored = false;
+	else if (ScanDirectionIsForward(direction))
 		scan->xs_batch->currIndex++;
 	else
 		scan->xs_batch->currIndex--;
@@ -1784,6 +1791,7 @@ index_batch_reset(IndexScanDesc scan)
 	scan->xs_batch->nheaptids = 0;
 	scan->xs_batch->prefetchIndex = 0;
 	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->restored = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e67b3938122..a328b2a7fe7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -518,6 +518,25 @@ btmarkpos(IndexScanDesc scan)
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/*
+	 * With batched scans, we don't maintain the itemIndex when processing the
+	 * batch, so we need to calculate the current value.
+	 *
+	 * FIXME I don't like that this requires knowledge of batching details,
+	 * I'd very much prefer those to remain isolated in indexam.c. The best
+	 * idea I have is a function that returns the batch index, something like
+	 * index_batch_get_index(). That's close to how other places in AMs talk
+	 * to indexam.c (e.g. when setting distances / orderby in spgist).
+	 */
+	if (scan->xs_batch)
+	{
+		/* the index should be valid in the batch */
+		Assert(scan->xs_batch->currIndex >= 0);
+		Assert(scan->xs_batch->currIndex < scan->xs_batch->nheaptids);
+
+		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->currIndex;
+	}
+
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
 	 * before releasing the marked position, _bt_steppage makes a full copy of
@@ -551,6 +570,59 @@ btrestrpos(IndexScanDesc scan)
 		 * accurate.
 		 */
 		so->currPos.itemIndex = so->markItemIndex;
+
+		/*
+		 * We're restoring to a different position on the same page, but the
+		 * wrong batch may be loaded. Check if the current batch includes
+		 * itemIndex, and load the correct batch if not. We don't know in
+		 * which direction the scan will move, so we try to put the current
+		 * index in the middle of a batch. For indexes close to the end of the
+		 * page we may load fewer items, but that seems acceptable.
+		 *
+		 * Then update the index of the current item, even if we already have
+		 * the correct batch loaded.
+		 *
+		 * FIXME Similar to btmarkpos() - I don't like how this leaks details
+		 * that should be specific to indexam.c. The "restored" flag is weird
+		 * too - but even if we need it, we could set it in indexam.c, right?
+		 */
+		if (scan->xs_batch != NULL)
+		{
+			if ((so->currPos.itemIndex < so->batch.firstIndex) ||
+				(so->currPos.itemIndex > so->batch.lastIndex))
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/*
+				 * XXX the scan direction is bogus / not important. It affects
+				 * only how we advance the currIndex, but we'll override that
+				 * anyway to point at the "correct" entry.
+				 */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+			}
+
+			/*
+			 * Set the batch index to the "correct" position in the batch,
+			 * even if we haven't re-loaded it from the page. Also remember we
+			 * just did this, so that the next call to
+			 * index_batch_getnext_tid() does not advance it again.
+			 *
+			 * XXX This is a bit weird. There should be a way to not need the
+			 * "restored" flag I think.
+			 */
+			scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+			scan->xs_batch->restored = true;
+		}
 	}
 	else
 	{
@@ -588,6 +660,45 @@ btrestrpos(IndexScanDesc scan)
 				_bt_start_array_keys(scan, so->currPos.dir);
 				so->needPrimScan = false;
 			}
+
+			/*
+			 * With batched scans, we know the current batch is definitely
+			 * wrong (we've moved to a different leaf page). So empty the
+			 * batch, and load the right part of the page.
+			 *
+			 * Similarly to the block above, we place the current index in the
+			 * middle of the batch (if possible). And then we update the index
+			 * to point at the correct batch item.
+			 */
+			if (scan->xs_batch != NULL)
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/* XXX the scan direction is bogus */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+
+				/*
+				 * Set the batch index to the "correct" position in the batch,
+				 * even if we haven't re-loaded it from the page. Also
+				 * remember we just did this, so that the next call to
+				 * index_batch_getnext_tid() does not advance it again.
+				 *
+				 * XXX This is a bit weird. There should be a way to not need
+				 * the "restored" flag I think.
+				 */
+				scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+				scan->xs_batch->restored = true;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 66ca064b7e8..5eba7982f65 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1547,7 +1547,7 @@ AssertCheckBTBatchInfo(BTScanOpaque so)
  * _bt_copy_batch
  *		Copy a section of the leaf page into the batch.
  */
-static void
+void
 _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
 			   int start, int end)
 {
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c030c0df6fe..8e55ffca197 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -613,14 +613,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
+	 * All index scans can do batching.
 	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->ioss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8bbd3606566..0a6dd840012 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -947,18 +947,13 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
-	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * All index scans can do batching.
 	 *
 	 * XXX Maybe this should check if the index AM supports batching, or even
 	 * call something like "amcanbatch" (does not exist yet). Or check the
-	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->iss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index df32382b311..3f500e254de 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1293,6 +1293,8 @@ extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+extern void _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+						   int start, int end);
 
 /*
  * prototypes for functions in nbtutils.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3d767e14356..6c887c04f67 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -205,6 +205,12 @@ typedef struct IndexScanBatchData
 	/* memory context for per-batch data */
 	MemoryContext ctx;
 
+	/*
+	 * Was this batch just restored by restrpos? if yes, we don't advance on
+	 * the first iteration.
+	 */
+	bool		restored;
+
 	/* batch prefetching */
 	int			prefetchTarget; /* current prefetch distance */
 	int			prefetchMaximum;	/* maximum prefetch distance */
-- 
2.46.1



  [text/x-patch] v20240930-0004-WIP-batching-for-hash-indexes.patch (14.3K, 5-v20240930-0004-WIP-batching-for-hash-indexes.patch)
  download | inline diff:
From 82071d7f72fff03ea2e41d2f129e348751d39a0b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:48:52 +0200
Subject: [PATCH v20240930 4/6] WIP: batching for hash indexes

---
 src/backend/access/hash/hash.c       |  43 ++++
 src/backend/access/hash/hashsearch.c | 283 +++++++++++++++++++++++++++
 src/include/access/hash.h            |  17 ++
 src/tools/pgindent/typedefs.list     |   1 +
 4 files changed, 344 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5ce36093943..d904548148d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = hashgetbatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
@@ -329,6 +330,42 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 }
 
 
+/*
+ *	hashgetbatch() -- Get the next batch of tuples in the scan.
+ */
+bool
+hashgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		res;
+
+	/* Hash indexes are always lossy since we store only the hash code */
+	scan->xs_recheck = true;
+
+	/*
+	 * If we've already initialized this scan, we can just advance it in the
+	 * appropriate direction.  If we haven't done so yet, we call a routine to
+	 * get the first item in the scan.
+	 */
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first_batch(scan, dir);
+	else
+	{
+		/*
+		 * Check to see if we should kill tuples from the previous batch.
+		 */
+		_hash_kill_batch(scan);
+
+		/*
+		 * Now continue the scan.
+		 */
+		res = _hash_next_batch(scan, dir);
+	}
+
+	return res;
+}
+
+
 /*
  *	hashgetbitmap() -- get all tuples at once
  */
@@ -403,6 +440,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
@@ -432,6 +472,9 @@ hashendscan(IndexScanDesc scan)
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0d99d6abc86..c11ae847a3b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -64,6 +64,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (++so->currPos.itemIndex > so->currPos.lastItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -82,6 +85,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (--so->currPos.itemIndex < so->currPos.firstItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -476,6 +482,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != 0)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the next page. Before leaving the current page, deal with any
@@ -535,6 +544,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the previous page. Before leaving the current page, deal with
@@ -713,3 +725,274 @@ _hash_saveitem(HashScanOpaque so, int itemIndex,
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 }
+
+static void
+AssertCheckHashBatchInfo(HashScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ *	_hash_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _hash_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_hash_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch. So why not an assert? But we can
+	 * get here in different ways - not just after beginscan/rescan, but also
+	 * when iterating over ScalarArrayOps - in which case we'll see the last
+	 * batch of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_hash_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _hash_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _hash_fist_batch().
+ *
+ * XXX See also the comments at _hash_first_batch() about returning a single
+ * batch for the page, etc.
+ *
+ * FIXME There's a lot of redundant (almost the same) code here - handling
+ * the current and new leaf page is very similar, and it's also similar to
+ * _hash_first_batch(). We should try to reduce this a bit.
+ */
+bool
+_hash_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_hash_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_hash_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_hash_kill_batch(IndexScanDesc scan)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxIndexTuplesPerPage * sizeof(int));
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+void
+_hash_copy_batch(IndexScanDesc scan, ScanDirection dir, HashScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		HashScanPosItem *currItem = &so->currPos.items[start];
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, false, NULL, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c7d81525b4..b710121f0ee 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -152,6 +152,14 @@ typedef struct HashScanPosData
 		(scanpos).itemIndex = 0; \
 	} while (0)
 
+/* Information about the current batch (in batched index scans) */
+typedef struct HashBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} HashBatchInfo;
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -182,6 +190,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	HashBatchInfo batch;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
@@ -369,6 +380,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool hashgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -444,6 +456,11 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _hash_copy_batch(IndexScanDesc scan, ScanDirection dir,
+							 HashScanOpaque so, int start, int end);
+extern void _hash_kill_batch(IndexScanDesc scan);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be89be987a2..b642aa3cb25 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1117,6 +1117,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchInfo
 HashBuildState
 HashCompareFunc
 HashCopyFunc
-- 
2.46.1



  [text/x-patch] v20240930-0005-WIP-batching-for-gist-indexes.patch (11.7K, 6-v20240930-0005-WIP-batching-for-gist-indexes.patch)
  download | inline diff:
From 9167d1e7bcc04914970cd6b7e6c47c6a8e3c3e2f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:48:57 +0200
Subject: [PATCH v20240930 5/6] WIP: batching for gist indexes

---
 src/backend/access/gist/gist.c    |   1 +
 src/backend/access/gist/gistget.c | 295 ++++++++++++++++++++++++++++++
 src/include/access/gist_private.h |  16 ++
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 313 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2d7a0687d4a..5000af98a11 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -100,6 +100,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = gistgetbatch;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..70e32f19366 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -25,6 +25,10 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void _gist_kill_batch(IndexScanDesc scan);
+static void _gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+							 int start, int end);
+
 /*
  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
  * told us were killed.
@@ -709,6 +713,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 			{
 				GISTSearchItem *item;
 
+				/*
+				 * XXX Probably not needed, we can't mix gettuple and
+				 * getbatch, so we should not get any killed tuples for a
+				 * batch. Maybe replace this with an assert that batch has no
+				 * killed items? Or simply that (xs_batch == NULL)?
+				 */
+				_gist_kill_batch(scan);
+
 				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
 					gistkillitems(scan);
 
@@ -736,6 +748,164 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	}
 }
 
+/*
+ * gistgetbatch() -- Get the next batch of tuples in the scan
+ */
+bool
+gistgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "GiST only supports forward scan direction");
+
+	if (!so->qual_ok)
+		return false;
+
+	if (so->firstCall)
+	{
+		/* Begin the scan by processing the root page */
+		GISTSearchItem fakeItem;
+
+		pgstat_count_index_scan(scan->indexRelation);
+
+		so->firstCall = false;
+		so->curPageData = so->nPageData = 0;
+		scan->xs_hitup = NULL;
+		if (so->pageDataCxt)
+			MemoryContextReset(so->pageDataCxt);
+
+		fakeItem.blkno = GIST_ROOT_BLKNO;
+		memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
+		gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
+
+		/*
+		 * Mark the batch as empty.
+		 *
+		 * XXX Is this necessary / the right place to do this? Maybe it should
+		 * be done in beginscan/rescan? The one problem with that is we only
+		 * initialize the batch after, so those places don't know if batching
+		 * is to be used.
+		 */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	/*
+	 * With order-by clauses, we simply get enough nearest items to fill the
+	 * batch. GIST does not handle killed items in this case for non-batched
+	 * scans, so we don't do that here either. The only issue is that the heap
+	 * tuple returned in xs_hitup is freed on the next getNextNearest() call,
+	 * so we make a copy in the batch memory context.
+	 *
+	 * For scans without order-by clauses, we simply copy a batch of items
+	 * from the next leaf page, and proceed to the next page when needed.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			nitems = 0;
+
+		/* Must fetch tuples in strict distance order */
+		while (nitems < scan->xs_batch->currSize)
+		{
+			HeapTuple	htup;
+			MemoryContext oldctx;
+
+			if (!getNextNearest(scan))
+				break;
+
+			/*
+			 * We need to copy the tuple into the batch context, as it may go
+			 * away on the next getNextNearest call. Free the original tuple,
+			 * otherwise we'd leak the last tuple in a batch.
+			 */
+			oldctx = MemoryContextSwitchTo(scan->xs_batch->ctx);
+			htup = heap_copytuple(scan->xs_hitup);
+
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
+
+			MemoryContextSwitchTo(oldctx);
+
+			index_batch_add(scan, scan->xs_heaptid, scan->xs_recheck, NULL, htup);
+
+			nitems++;
+		}
+
+		/* did we find more tuples? */
+		return (scan->xs_batch->nheaptids > 0);
+	}
+	else
+	{
+		/* Fetch the next batch of tuples */
+		for (;;)
+		{
+			int			start,
+						end;
+
+			/* forward directions only, easy to calculate next batch */
+			start = so->batch.lastIndex + 1;
+			end = Min(start + (scan->xs_batch->currSize - 1),
+					  so->nPageData - 1);	/* index of last item */
+			so->curPageData = (end + 1);
+
+			/* if we found more items on the current page, we're done */
+			if (start <= end)
+			{
+				_gist_copy_batch(scan, so, start, end);
+				return true;
+			}
+
+			/* find and process the next index page */
+			do
+			{
+				GISTSearchItem *item;
+
+				_gist_kill_batch(scan);
+
+				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+					gistkillitems(scan);
+
+				item = getNextGISTSearchItem(so);
+
+				if (!item)
+					return false;
+
+				CHECK_FOR_INTERRUPTS();
+
+				/* save current item BlockNumber for next gistkillitems() call */
+				so->curBlkno = item->blkno;
+
+				/*
+				 * While scanning a leaf page, ItemPointers of matching heap
+				 * tuples are stored in so->pageData.  If there are any on
+				 * this page, we fall out of the inner "do" and loop around to
+				 * return them.
+				 */
+				gistScanPage(scan, item, item->distances, NULL, NULL);
+
+				/*
+				 * If we found any items on the page, copy them into a batch
+				 * and we're done.
+				 */
+
+				/* forward direction only, so get the first chunk of items */
+				start = 0;
+				end = Min(start + (scan->xs_batch->currSize - 1),
+						  so->nPageData - 1);	/* index of last item */
+
+				if (start <= end)
+				{
+					_gist_copy_batch(scan, so, start, end);
+					return true;
+				}
+
+				pfree(item);
+			} while (so->nPageData == 0);
+		}
+	}
+}
+
 /*
  * gistgetbitmap() -- Get a bitmap of all heap tuple locations
  */
@@ -799,3 +969,128 @@ gistcanreturn(Relation index, int attno)
 	else
 		return false;
 }
+
+static void
+AssertCheckGISTBatchInfo(GISTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPageData);
+#endif
+}
+
+/*
+ *	_gist_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+static void
+_gist_kill_batch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+		{
+			MemoryContext oldCxt =
+				MemoryContextSwitchTo(so->giststate->scanCxt);
+
+			so->killedItems =
+				(OffsetNumber *) palloc(MaxIndexTuplesPerPage
+										* sizeof(OffsetNumber));
+
+			MemoryContextSwitchTo(oldCxt);
+		}
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckGISTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		GISTSearchHeapItem *item = &so->pageData[start];
+
+		HeapTuple	htup = NULL;
+
+		if (scan->xs_want_itup)
+			htup = item->recontup;
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, item->heapPtr, item->recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..2f653354da2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -148,6 +148,19 @@ typedef struct GISTSearchItem
 	(offsetof(GISTSearchItem, distances) + \
 	 sizeof(IndexOrderByDistance) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct GISTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} GISTBatchInfo;
+
 /*
  * GISTScanOpaqueData: private state for a scan of a GiST index
  */
@@ -176,6 +189,8 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	GISTBatchInfo batch;		/* batch loaded from the index */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -461,6 +476,7 @@ extern XLogRecPtr gistXLogAssignLSN(void);
 
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool gistgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool gistcanreturn(Relation index, int attno);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b642aa3cb25..7313530c291 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -968,6 +968,7 @@ GBT_NUMKEY_R
 GBT_VARKEY
 GBT_VARKEY_R
 GENERAL_NAME
+GISTBatchInfo
 GISTBuildBuffers
 GISTBuildState
 GISTDeletedPageContents
-- 
2.46.1



  [text/x-patch] v20240930-0006-WIP-batching-for-sp-gist-indexes.patch (7.0K, 7-v20240930-0006-WIP-batching-for-sp-gist-indexes.patch)
  download | inline diff:
From f0e0547d45f0a115844a19da9fe124c20e6a1682 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 30 Sep 2024 22:49:04 +0200
Subject: [PATCH v20240930 6/6] WIP: batching for sp-gist indexes

---
 src/backend/access/spgist/spgscan.c  | 142 +++++++++++++++++++++++++++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/spgist.h          |   1 +
 src/include/access/spgist_private.h  |  15 +++
 src/tools/pgindent/typedefs.list     |   1 +
 5 files changed, 160 insertions(+)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 3017861859f..4b96852c28c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -1077,6 +1077,148 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 	return false;
 }
 
+static void
+AssertCheckSpGistBatchInfo(SpGistScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPtrs);
+#endif
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_spgist_copy_batch(IndexScanDesc scan, SpGistScanOpaque so,
+				   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckSpGistBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		bool		recheck = so->recheck[start];
+		HeapTuple	htup = NULL;
+
+		if (so->want_itup)
+			htup = so->reconTups[start];
+
+		if (so->numberOfOrderBys > 0)
+			index_store_float8_orderby_distances(scan, so->orderByTypes,
+												 so->distances[so->iPtr],
+												 so->recheckDistances[so->iPtr]);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, so->heapPtrs[start], recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+
+bool
+spggetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "SP-GiST only supports forward scan direction");
+
+	/* Copy want_itup to *so so we don't need to pass it around separately */
+	so->want_itup = scan->xs_want_itup;
+
+	for (;;)
+	{
+		int			start,
+					end;
+
+		/* forward directions only, easy to calculate next batch */
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1),
+				  so->nPtrs - 1);	/* index of last item */
+		so->iPtr = (end + 1);
+
+		/* if we found more items on the current page, we're done */
+		if (start <= end)
+		{
+			_spgist_copy_batch(scan, so, start, end);
+			return true;
+		}
+
+		if (so->numberOfOrderBys > 0)
+		{
+			/* Must pfree distances to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				if (so->distances[i])
+					pfree(so->distances[i]);
+		}
+
+		if (so->want_itup)
+		{
+			/* Must pfree reconstructed tuples to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				pfree(so->reconTups[i]);
+		}
+		so->iPtr = so->nPtrs = 0;
+
+		spgWalk(scan->indexRelation, so, false, storeGettuple);
+
+		if (so->nPtrs == 0)
+			break;				/* must have completed scan */
+
+		/* reset before loading data from batch */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	return false;
+}
+
 bool
 spgcanreturn(Relation index, int attno)
 {
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 72b7661971f..25b363d380d 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -85,6 +85,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = spggetbatch;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index d6a49531200..f879843b3bb 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -209,6 +209,7 @@ extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
 extern int64 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool spggetbatch(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index e7cbe10a89b..15a5e77c5d3 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -183,6 +183,19 @@ typedef struct SpGistSearchItem
 #define SizeOfSpGistSearchItem(n_distances) \
 	(offsetof(SpGistSearchItem, distances) + sizeof(double) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct SpGistBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} SpGistBatchInfo;
+
 /*
  * Private state of an index scan
  */
@@ -235,6 +248,8 @@ typedef struct SpGistScanOpaqueData
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	SpGistBatchInfo batch;		/* batch loaded from the index */
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7313530c291..88be745d9ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2702,6 +2702,7 @@ SortSupportData
 SortTuple
 SortTupleComparator
 SortedPoint
+SpGistBatchInfo
 SpGistBuildState
 SpGistCache
 SpGistDeadTuple
-- 
2.46.1

view thread (276+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox