Re: index prefetching

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: index prefetching
2+ messages / 1 participants
[nested] [flat]

* Re: index prefetching
@ 2026-01-13 20:36  Peter Geoghegan <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Peter Geoghegan @ 2026-01-13 20:36 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Andres Freund <[email protected]>; Thomas Munro <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Robert Haas <[email protected]>; Melanie Plageman <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Jan 7, 2026 at 1:50 PM Peter Geoghegan <[email protected]> wrote:
> v6 focusses on simplifying the batch management code in
> heapam_batch_getnext_tid. Importantly, heapam_batch_getnext_tid no
> longer uses a loop to process items from the currently loaded batch/to
> load the next batch. The control flow in heapam_batch_getnext_tid is a
> lot simpler in general compared to v5.

The batch stopped applying again. Attached is v7.

Changes:

* Much improved read stream callback code, with comments that explain
exactly what's going on at each step.

* We once again support prefetching with index-only scans (for any
heap fetches that might be required).

We must cache visibility information at the level of whole batches for
this. Otherwise, the read stream have a different idea about which
heap pages are considered all visible from other code. The
corresponding code that actually reads heap tuples expects to be able
to get buffers from the read stream that precisely match what it
believes are required for any heap fetches. If they don't both agree
about visibility  info, chaos ensues.

We avoid per-tuple visibility lookups in v7, preferring to do
everything (every VM lookup for every TID) up front for each batch.
This is simpler, and I suspect it's somewhat faster on average.

* Fixed several bugs involving scrollable cursors + index prefetching.

> I still haven't had time to produce an implementation of the "heap
> buffer locking minimization" optimization that's clean enough to
> present to the list.

Still haven't done this. Our new thinking on this is that it'd be best
to get prefetching in better shape before proceeding with the heap
buffer locking optimization.

-- 
Peter Geoghegan


Attachments:

  [application/octet-stream] v7-0004-Make-hash-index-AM-use-amgetbatch-interface.patch (37.1K, 2-v7-0004-Make-hash-index-AM-use-amgetbatch-interface.patch)
  download | inline diff:
From 7d69161eb80df8cea1ee74357bf43e0622e2534b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v7 4/4] Make hash index AM use amgetbatch interface.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface.  Plain index scans of hash indexes now return
matching items in batches consisting of all of the matches from a given
bucket or overflow page.  This gives the core executor the ability to
perform optimizations like index prefetching during hash index scans.

Note that this makes hash index scans use the dropPin optimization,
since that is required to use the amgetbatch interface.  This won't
avoid making hash index vacuuming wait for a cleanup lock when an index
scan holds onto a conflicting pin, since such index scans still need to
hold onto a conflicting bucket page pin for the duration of the scan.
In other words, use of the dropPin optimization won't benefit hash index
scans in the same way that it benefits nbtree index scans following
commit 2ed5b87f.  However, there is still some value in keeping the
number of buffer pins held at any one time to a minimum.

Index prefetching tends to hold open as many as several dozen batches
with certain workloads (workloads where the stream position has to get
quite far ahead of the read position in order to maintain the
appropriate prefetch distance on the heapam side).  Guaranteeing that
open batches won't hold buffer pins on index pages (at least in the
common case where the dropPin optimization is safe to use) thereby
simplifies resource management during index prefetching.

Also add Valgrind buffer lock instrumentation to hash, bringing it in
line with nbtree following commit 4a70f829.  This is another requirement
when using the amgetbatch interface.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  73 +-----
 src/backend/access/hash/hash.c       | 122 ++++------
 src/backend/access/hash/hashpage.c   |  26 +--
 src/backend/access/hash/hashsearch.c | 331 +++++++++++----------------
 src/backend/access/hash/hashutil.c   | 100 ++++----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 256 insertions(+), 398 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..1b80fd7ed 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6a20b67f6..361357428 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amfreebatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amfreebatch = hashfreebatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -286,53 +286,27 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	Relation	rel = scan->indexRelation;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
+		_hash_dropscanbuf(rel, so);
 
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +316,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,17 +354,14 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
 
 	return scan;
 }
@@ -408,18 +376,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +386,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +414,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8e220a3ae..9b6911905 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -35,6 +35,7 @@
 #include "port/pg_bitutils.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
@@ -79,6 +80,9 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 	if (access != HASH_NOLOCK)
 		LockBuffer(buf, access);
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -108,6 +112,9 @@ _hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
 		return InvalidBuffer;
 	}
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -280,31 +287,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 89d1c5bc6..c72705dc3 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,105 +22,83 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
+			return NULL;
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
+			return NULL;
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	else
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -270,22 +248,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -296,7 +273,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -326,7 +303,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -419,191 +396,159 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released.
+		 */
+		IncrBufferRefCount(batch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	batch->moreLeft = BlockNumberIsValid(batch->prevPage);
+	batch->moreRight = BlockNumberIsValid(batch->nextPage);
+
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -640,7 +585,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -687,7 +632,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -706,13 +651,15 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
+	currItem->allVisible = false;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index cf7f0b901..f99105d3b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,81 +510,84 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	for (i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in index page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +616,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d25ba3412..3173b3330 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1195,8 +1195,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0



  [application/octet-stream] v7-0002-Add-prefetching-to-index-scans-using-batch-interf.patch (27.5K, 3-v7-0002-Add-prefetching-to-index-scans-using-batch-interf.patch)
  download | inline diff:
From 02ab6053d9ea64b46a69a55fe197526d0c95956e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v7 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch ring buffer reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to temporarily yield without it
fully ending/resetting the read stream.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/include/access/relscan.h                  |  41 ++-
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 330 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  17 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 10 files changed, 418 insertions(+), 7 deletions(-)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ca1207be6..a517abe08 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 /*
@@ -221,8 +223,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->ringbuf->nextBatch - (scan)->ringbuf->headBatch)
@@ -270,15 +278,46 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * matches in.  However, table AMs are free to fetch table tuples in whatever
  * order is most convenient/efficient -- provided that such reordering cannot
  * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
  */
 typedef struct BatchRingBuffer
 {
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
 
 	/* current positions in batches[] for scan */
 	BatchRingItemPos scanPos;	/* scan's read position */
 	BatchRingItemPos markPos;	/* mark/restore position */
+	BatchRingItemPos prefetchPos;	/* prefetching position */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 07b8bfa63..31236ceac 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 03b4fada9..deef73069 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -37,6 +37,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -60,6 +61,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -85,6 +89,7 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
@@ -97,6 +102,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
@@ -113,6 +121,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -150,7 +161,13 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * When using a read stream, the stream will already know which block
 		 * number comes next (though an assertion will verify a match below)
 		 */
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		Assert(BufferIsValid(hscan->xs_cbuf));
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -248,6 +265,15 @@ heapam_batch_return_tid(IndexScanDesc scan, BatchIndexScan scanBatch,
 /*
  * heap_batch_resolve_visibility
  *		Obtain visibility information for every TID from caller's batch.
+ *
+ * heapam_batch_getnext_tid must reliably agree with heapam_getnext_stream
+ * about which heap blocks/TIDs will require a heap fetch (and which TIDs
+ * won't due to pointing to an all-visible heap page).  Otherwise we risk
+ * allowing the read stream to return unexpected heap buffers/pages.
+ *
+ * Caching visibility information up front avoids that problem.  If a VM bit
+ * is concurrently set (or unset), it can't matter, since everybody will have
+ * works off of this immutable local cache.
  */
 static void
 heap_batch_resolve_visibility(IndexScanDesc scan, BatchIndexScan batch)
@@ -377,6 +403,19 @@ heap_batch_getnext(IndexScanDesc scan, BatchIndexScan priorbatch,
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  ringbuf->headBatch, ringbuf->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (!scan->xs_heapfetch->rs && !ringbuf->disabled && priorbatch &&
+			enable_indexscan_prefetch)
+		{
+			Assert(INDEX_SCAN_POS_INVALID(&ringbuf->prefetchPos));
+			Assert(ringbuf->currentPrefetchBlock == InvalidBlockNumber);
+
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   heapam_getnext_stream, scan, 0);
+		}
 	}
 
 	/* xs_hitup is not supported by amgetbatch scans */
@@ -411,9 +450,53 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (ringbuf->direction == NoMovementScanDirection)
 		ringbuf->direction = direction;
+	else if (unlikely(ringbuf->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&ringbuf->prefetchPos);
+		ringbuf->currentPrefetchBlock = InvalidBlockNumber;
 
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+	else if (unlikely(ringbuf->reset))
+	{
+		ringbuf->reset = false;
+
+		/*
+		 * Need to reset the stream position, it might be too far behind.
+		 * Ultimately we want to set it to scanPos, but we can't do that yet -
+		 * scanPos still point sat the old batch, so just reset it and we'll
+		 * init it to scanPos later in the callback.
+		 */
+		batch_reset_pos(&ringbuf->prefetchPos);
+		ringbuf->currentPrefetchBlock = InvalidBlockNumber;
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/*
+	 * XXX Shouldn't this also update the ringbuf->direction? If we get to the
+	 * next block hangling direction change, then we will remember it (because
+	 * heapam_batch_rewind will store it). But if we return in the next block,
+	 * won't we forget about it?
+	 *
+	 * XXX It's a bit weird we handle the direction change in two places.
+	 * Would be good to explain why that's necessary.
+	 *
+	 * XXX How come this doesn't need to do heapam_batch_rewind too? Could
+	 * there be some future batches already loaded?
+	 */
 	if (unlikely(ringbuf->direction != direction))
 	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+		batch_reset_pos(&ringbuf->prefetchPos);
+		ringbuf->currentPrefetchBlock = InvalidBlockNumber;
+
 		/* We may change direction after reading the last batch. */
 		scan->finished = false;
 	}
@@ -497,6 +580,251 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return heapam_batch_return_tid(scan, scanBatch, scanPos);
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * The initial batch is always loaded by heapam_batch_getnext_tid.  We don't
+ * get called until the first read_stream_next_buffer() call, when a heap
+ * block is requested from the scan's stream for the first time.
+ *
+ * The position of the read_stream is stored in prefetchPos.  It is typical for
+ * prefetchPos to consistently stay ahead of the scanPos position that's used to
+ * track the next TID to be returned to the scan by heapam_batch_getnext_tid
+ * after the first time we get called.  However, that isn't a precondition.
+ * There is a strict postcondition, though: when we return we'll always leave
+ * scanPos <= prefetchPos (except in cases where we return InvalidBlockNumber).
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchRingItemPos *scanPos = &ringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &ringbuf->prefetchPos;
+	ScanDirection direction = ringbuf->direction;
+	BatchIndexScan prefetchBatch;
+	bool		fromReadPos = false;
+
+	Assert(!scan->finished && !ringbuf->disabled);
+
+	/*
+	 * scanPos must always be valid when we're called -- there has to be at
+	 * least one batch, loaded, for scanBatch.  prefetchPos might not yet be
+	 * valid, in which case it'll be initialized using scanPos.
+	 */
+	Assert(INDEX_SCAN_BATCH_COUNT(scan) > 0);
+	batch_assert_pos_valid(scan, scanPos);
+
+	/*
+	 * It is possible for the scan's direction to change, but that's handled
+	 * elsewhere.  We don't know how to deal with any variation in scan
+	 * direction here.  We assume that all loaded and newly requested batches
+	 * must use the same scan direction.
+	 */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * If the stream position has not been initialized yet, initialize it
+	 * using the current read position.
+	 *
+	 * We do this now (rather than doing it when the read stream is created)
+	 * to avoid incorrectly returning the scan's IndexFetchHeapData.xs_blk as
+	 * it was at the time of read stream creation.  Note that the scan might
+	 * have to hold onto its existing not-managed-by-read-stream buffer pin
+	 * after the read stream is created; there'll often be a few more heap
+	 * TIDs that point to the same pinned heap page from before.
+	 */
+	if (INDEX_SCAN_POS_INVALID(prefetchPos))
+	{
+		Assert(ringbuf->currentPrefetchBlock == InvalidBlockNumber);
+
+		*prefetchPos = *scanPos;
+		fromReadPos = true;
+	}
+
+	prefetchBatch = INDEX_SCAN_BATCH(scan, prefetchPos->batch);
+	for (;;)
+	{
+		BatchMatchingItem *item;
+		ItemPointer tid;
+
+		if (fromReadPos)
+		{
+			/*
+			 * Don't increment item when prefetchPos was just initialized
+			 * using scanPos.  We return the scanPos item's heap block
+			 * directly on the first call here.
+			 */
+			fromReadPos = false;
+		}
+		else if (!heap_batchpos_advance(prefetchBatch, prefetchPos, direction))
+		{
+			/*
+			 * Ran out of items from prefetchBatch.  Try to advance it to next
+			 * batch.
+			 */
+			if (INDEX_SCAN_BATCH_LOADED(scan, prefetchPos->batch + 1))
+			{
+				/*
+				 * The next batch was already loaded for us.
+				 *
+				 * Typically, prefetchPos is ahead of scanPos for the entire
+				 * duration of the scan (at least after we're first called).
+				 * However, prefetchPos can sometimes fall behind scanPos.
+				 * That's why we need to handle already-loaded batches here.
+				 *
+				 * This happens when some blocks are skipped and not returned
+				 * to the read_stream.  An example is an index scan on a
+				 * correlated index, with many duplicate blocks are skipped,
+				 * or an IOS where all-visible blocks are skipped.
+				 */
+				prefetchBatch = INDEX_SCAN_BATCH(scan, prefetchPos->batch + 1);
+			}
+			else
+			{
+				/*
+				 * If we already used the maximum number of batch slots
+				 * available, it's pointless to try loading another one. This
+				 * can happen for various reasons, e.g. for index-only scans
+				 * on all-visible table, or skipping duplicate blocks on
+				 * perfectly correlated indexes, etc.
+				 *
+				 * We could enlarge the array to allow more batches, but
+				 * that's futile, we can always construct a case using more
+				 * memory. Not only it would risk OOM, it'd also be
+				 * inefficient because this happens early in the scan (so it'd
+				 * interfere with LIMIT queries).
+				 */
+				if (INDEX_SCAN_BATCH_FULL(scan))
+				{
+					DEBUG_LOG("batch_getnext: ran out of space for batches");
+					scan->ringbuf->reset = true;
+					break;
+				}
+
+				prefetchBatch = heap_batch_getnext(scan, prefetchBatch, direction);
+				if (!prefetchBatch)
+				{
+					/*
+					 * Failed to load next batch, so all the batches that the
+					 * scan will ever require (barring a change in scan
+					 * direction) are now loaded
+					 */
+					scan->finished = true;
+					break;
+				}
+
+				/*
+				 * Consider disabling prefetching when we can't keep a
+				 * sufficiently large "index tuple distance" between scanPos
+				 * and prefetchPos.
+				 *
+				 * Only consider doing this when we're not on the scan's
+				 * initial batch, when scanPos and prefetchPos share the same
+				 * batch.
+				 */
+				if (!ringbuf->prefetchingLockedIn)
+				{
+					int			itemdiff;
+
+					if (prefetchPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+					{
+						/*
+						 * Too early to check if prefetching should be
+						 * disabled
+						 */
+					}
+					else if (scanPos->batch == prefetchPos->batch)
+					{
+						if (ScanDirectionIsForward(direction))
+							itemdiff = prefetchPos->item - scanPos->item;
+						else
+						{
+							BatchIndexScan scanBatch =
+								INDEX_SCAN_BATCH(scan, scanPos->batch);
+
+							itemdiff = (scanPos->item - scanBatch->firstItem) -
+								(prefetchPos->item - scanBatch->firstItem);
+						}
+
+						if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+						{
+							ringbuf->disabled = true;
+							return InvalidBlockNumber;
+						}
+						else
+						{
+							ringbuf->prefetchingLockedIn = true;
+						}
+					}
+					else
+						ringbuf->prefetchingLockedIn = true;
+				}
+			}
+
+			/* Position prefetchPos to the start of new prefetchBatch */
+			heap_batchpos_newbatch(prefetchBatch, prefetchPos, direction);
+			Assert(INDEX_SCAN_BATCH(scan, prefetchPos->batch) == prefetchBatch);
+		}
+
+		/*
+		 * We advanced the position.  Either return the block for the TID, or
+		 * skip it (and then try advancing again).
+		 */
+		Assert(prefetchBatch->dir == direction);
+		Assert(scanPos->batch < prefetchPos->batch ||
+			   (scanPos->batch == prefetchPos->batch &&
+				ScanDirectionIsForward(direction) ?
+				scanPos->item <= prefetchPos->item :
+				scanPos->item >= prefetchPos->item));
+
+		/*
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		batch_assert_pos_valid(scan, prefetchPos);
+		item = &prefetchBatch->items[prefetchPos->item];
+		tid = &item->heapTid;
+
+		DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+				  prefetchPos->item,
+				  ItemPointerGetBlockNumber(tid),
+				  ItemPointerGetOffsetNumber(tid));
+
+		/*
+		 * For index-only scans, determine if the page is all-visible now.  If
+		 * it is, we won't need the block and can skip it too.
+		 */
+		if (scan->xs_want_itup && item->allVisible)
+			continue;
+
+		/* same block as before, don't need to read it */
+		if (ringbuf->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+		{
+			DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+			continue;
+		}
+
+		ringbuf->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+		return ringbuf->currentPrefetchBlock;
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9eaecd943..d15a2ee81 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -467,7 +467,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->ringbuf->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 86a1c6f56..4df92e7af 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -74,11 +74,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->finished = false;
+	scan->ringbuf->reset = false;
+	scan->ringbuf->prefetchingLockedIn = false;
+	scan->ringbuf->disabled = false;
+	scan->ringbuf->currentPrefetchBlock = InvalidBlockNumber;
 	scan->ringbuf->direction = NoMovementScanDirection;
 
 	/* positions in the ring buffer of batches */
 	batch_reset_pos(&scan->ringbuf->scanPos);
 	batch_reset_pos(&scan->ringbuf->markPos);
+	batch_reset_pos(&scan->ringbuf->prefetchPos);
 
 	scan->ringbuf->markBatch = NULL;
 	scan->ringbuf->headBatch = 0;	/* initial head batch */
@@ -107,9 +112,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&ringbuf->scanPos);
+	batch_reset_pos(&ringbuf->prefetchPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -153,6 +161,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	ringbuf->nextBatch = 0;		/* initial batch is empty */
 
 	scan->finished = false;
+	ringbuf->reset = false;
+	ringbuf->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -213,9 +223,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchRingBuffer *ringbuf = scan->ringbuf;
 	BatchRingItemPos *markPos = &ringbuf->markPos;
-	BatchRingItemPos *scanPos = &ringbuf->scanPos ;
 	BatchIndexScan markBatch = ringbuf->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with prefetchPos are fully understood.
+	 */
+#if 0
 	if (scanPos->batch == markPos->batch &&
 		scanPos->batch == ringbuf->headBatch)
 	{
@@ -226,6 +240,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		scanPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 16bf1f61a..23e7c0a2f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 88717c2ff..7463651e0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7c60b1255..a99aa41db 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -891,6 +891,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3dd63fd88..b5628736b 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0



  [application/octet-stream] v7-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch (6.9K, 4-v7-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch)
  download | inline diff:
From fe184933383163d89d859fe7cf759704626c1154 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v7 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 715ae96f0..0b6fa848a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a036c2aa2..d3dd64808 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1632,6 +1632,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1764,7 +1804,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1783,11 +1823,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1873,6 +1968,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1930,9 +2062,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0



  [application/octet-stream] v7-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patch (201.6K, 5-v7-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patch)
  download | inline diff:
From a62617bf8ef00aeb70d193d820128a86f763f651 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v7 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  24 +-
 src/include/access/heapam.h                   |   3 +
 src/include/access/nbtree.h                   | 176 ++----
 src/include/access/relscan.h                  | 252 +++++++-
 src/include/access/tableam.h                  |  41 ++
 src/include/executor/instrument_node.h        |   6 +
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 563 ++++++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  15 +-
 src/backend/access/index/indexam.c            | 135 ++--
 src/backend/access/index/indexbatch.c         | 589 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |   6 +-
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 195 +++---
 src/backend/access/nbtree/nbtree.c            | 299 +++------
 src/backend/access/nbtree/nbtsearch.c         | 510 ++++++---------
 src/backend/access/nbtree/nbtutils.c          |  93 +--
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 108 +---
 src/backend/executor/nodeIndexscan.c          |  13 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 src/backend/utils/adt/selfuncs.c              |  68 +-
 contrib/bloom/blutils.c                       |   5 +-
 doc/src/sgml/indexam.sgml                     | 311 +++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   5 +-
 src/tools/pgindent/typedefs.list              |   4 -
 45 files changed, 2319 insertions(+), 1240 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index ecfbd017d..962c70f43 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4c0429cc6..01f059fc4 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -94,6 +94,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -154,6 +155,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool xs_want_itup,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -180,14 +182,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool xs_want_itup,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -251,4 +251,22 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+extern void index_batch_init(IndexScanDesc scan);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab3..494f36ecc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,7 +117,10 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 772248596..e2f210f07 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a ring buffer of "batches", where each batch represents the
+ * items returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076..ca1207be6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,8 +16,10 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
+#include "storage/buf.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +126,179 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Location of a BatchMatchingItem that appears in a BatchIndexScan returned
+ * by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchRingItemPos
+{
+	/* BatchRingBuffer.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchRingItemPos;
+
+static inline void
+batch_reset_pos(BatchRingItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+	bool		allVisible;		/* TID points to all-visible page */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->ringbuf->nextBatch - (scan)->ringbuf->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) >= (scan)->ringbuf->headBatch && (idx) < (scan)->ringbuf->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+( \
+	AssertMacro(INDEX_SCAN_BATCH_LOADED(scan, idx)), \
+	((scan)->ringbuf->batches[(idx) % INDEX_SCAN_MAX_BATCHES]) \
+)
+
+/* Append given batch to scan's batch ring buffer */
+#define INDEX_SCAN_BATCH_APPEND(scan, batch) \
+	do { \
+		BatchRingBuffer *mringbuf = (scan)->ringbuf;	\
+		int				nextBatch = mringbuf->nextBatch; \
+		mringbuf->batches[nextBatch % INDEX_SCAN_MAX_BATCHES] = (batch); \
+		mringbuf->nextBatch++; \
+	} while(0)
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans use a ring buffer of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ */
+typedef struct BatchRingBuffer
+{
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to ring buffer batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchRingBuffer;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -141,6 +316,13 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* table access method's private amgetbatch state */
+	BatchRingBuffer *ringbuf;	/* amgetbatch related state */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -149,9 +331,13 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
 
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/*
+	 * Did we read the final batch in this scan direction?
+	 */
+	bool		finished;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -176,6 +362,8 @@ typedef struct IndexScanDescData
 	IndexFetchTableData *xs_heapfetch;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	bool		xs_visible;		/* T means the heap page is all-visible */
+	uint16		maxitemsbatch;	/* set by ambeginscan when amgetbatch used */
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -215,4 +403,64 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchRingItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchIndexScan batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= ringbuf->headBatch);
+	Assert(pos->batch < ringbuf->nextBatch);
+	Assert(pos->item >= batch->firstItem);
+	Assert(pos->item <= batch->lastItem);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+
+	/* we should have batches initialized */
+	Assert(ringbuf != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(ringbuf->headBatch >= 0 &&
+		   ringbuf->headBatch <= ringbuf->nextBatch);
+
+	/* Check all current batches */
+	for (int i = ringbuf->headBatch; i < ringbuf->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d..98e337972 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -433,11 +433,29 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -1188,6 +1206,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
+ * can consume index tuple results by examining IndexScanDescData fields such
+ * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
+ * into the provided slot in the case of xs_want_itup=true callers.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	return iscan->heapRelation->rd_tableam->index_getnext_slot(iscan,
+															   direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1249,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94..b5b8f509a 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -48,6 +48,12 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * heap blocks fetched counts (incremented by index_getnext_slot calls
+	 * within table AMs, though only during index-only scans)
+	 */
+	uint64		nheapfetches;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e5..793b1a3c6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1753,7 +1753,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1776,7 +1775,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 449885b93..3c81602b4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e4214..d5d01b877 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d205093e2..1263dc180 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d5944205d..9c219cadf 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e88ddb32a..6a20b67f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d..03b4fada9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include "access/amapi.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -81,10 +82,12 @@ heapam_slot_callbacks(Relation relation)
 static IndexFetchTableData *
 heapam_index_fetch_begin(Relation rel)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
+	hscan->vmbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +97,12 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +113,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -125,22 +136,32 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*call_again);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -173,6 +194,521 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+static pg_noinline void
+heapam_batch_rewind(IndexScanDesc scan, BatchRingBuffer *ringbuf,
+					ScanDirection direction)
+{
+	/*
+	 * Handle a change in the scan's direction.
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded.
+	 */
+	while (ringbuf->nextBatch > ringbuf->headBatch + 1)
+	{
+		/* release "later" batches in reverse order */
+		BatchIndexScan fbatch;
+
+		fbatch = INDEX_SCAN_BATCH(scan, ringbuf->nextBatch - 1);
+		batch_free(scan, fbatch);
+		ringbuf->nextBatch--;
+	}
+
+	/*
+	 * Remember the new direction, and make sure the scan is not marked as
+	 * "finished" (we might have already read the last batch, but now we need
+	 * to start over).
+	 */
+	ringbuf->direction = direction;
+	scan->finished = false;
+}
+
+static inline ItemPointer
+heapam_batch_return_tid(IndexScanDesc scan, BatchIndexScan scanBatch,
+						BatchRingItemPos *scanPos)
+{
+	batch_assert_pos_valid(scan, scanPos);
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* set the TID / itup for the scan */
+	scan->xs_heaptid = scanBatch->items[scanPos->item].heapTid;
+
+	/* plain index scans will have flags left set to 0 */
+	scan->xs_visible = scanBatch->items[scanPos->item].allVisible;
+
+	if (scan->xs_want_itup)
+		scan->xs_itup =
+			(IndexTuple) (scanBatch->currTuples +
+						  scanBatch->items[scanPos->item].tupleOffset);
+
+	return &scan->xs_heaptid;
+}
+
+/*
+ * heap_batch_resolve_visibility
+ *		Obtain visibility information for every TID from caller's batch.
+ */
+static void
+heap_batch_resolve_visibility(IndexScanDesc scan, BatchIndexScan batch)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+
+	for (int i = batch->firstItem; i <= batch->lastItem; i++)
+	{
+		BatchMatchingItem *item = &batch->items[i];
+		ItemPointer tid = &item->heapTid;
+
+		if (VM_ALL_VISIBLE(scan->heapRelation,
+						   ItemPointerGetBlockNumber(tid),
+						   &hscan->vmbuf))
+		{
+			item->allVisible = true;
+		}
+	}
+}
+
+/*
+ * heap_batchpos_advance
+ *		Advance position to its next item in the batch.
+ *
+ * Advance to the next item within the provided batch (or to the previous item,
+ * when scanning backwards).
+ *
+ * Returns true if the position could be advanced.  Returns false when there
+ * are no more items in the batch in the given direction.
+ */
+static inline bool
+heap_batchpos_advance(BatchIndexScan batch, BatchRingItemPos *pos,
+					  ScanDirection direction)
+{
+	Assert(!INDEX_SCAN_POS_INVALID(pos));
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item > batch->lastItem)
+			return false;
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item < batch->firstItem)
+			return false;
+	}
+
+	/* Advanced within batch */
+	return true;
+}
+
+/*
+ * heap_batchpos_newbatch
+ *		Advance batch position start of its new batch.
+ *
+ * Sets the given position to the fist item in the given scan direction (or to
+ * the last item, when scanning backwards).   Also advances/increments batch
+ * offset from position such that it points to newBatchForPos.
+ */
+static inline void
+heap_batchpos_newbatch(BatchIndexScan newBatchForPos, BatchRingItemPos *pos,
+					   ScanDirection direction)
+{
+	Assert(newBatchForPos->dir == direction);
+
+	/* Next batch successfully loaded */
+	pos->batch++;
+	if (ScanDirectionIsForward(direction))
+		pos->item = newBatchForPos->firstItem;
+	else
+		pos->item = newBatchForPos->lastItem;
+
+	Assert(!INDEX_SCAN_POS_INVALID(pos));
+}
+
+/* ----------------
+ *		heap_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called when we need to load the next batch of index entries to process in
+ * the given direction.
+ *
+ * Returns the next batch to be processed by the index scan, or NULL when
+ * there are no more matches in the given scan direction.  Also appends the
+ * returned batch to the end of the scan's batch ring buffer.
+ * ----------------
+ */
+static BatchIndexScan
+heap_batch_getnext(IndexScanDesc scan, BatchIndexScan priorbatch,
+				   ScanDirection direction)
+{
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchIndexScan batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * When caller provides a priorbatch it had better be for the last valid
+	 * batch currently in the batch ring buffer (otherwise appending a new
+	 * batch would result batches that aren't in scan order, which is wrong).
+	 */
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	Assert(!priorbatch ||
+		   (INDEX_SCAN_BATCH_COUNT(scan) > 0 &&
+			INDEX_SCAN_BATCH(scan, ringbuf->nextBatch - 1) == priorbatch));
+
+	if (scan->finished)
+		return NULL;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- append it */
+		Assert(batch->dir == direction);
+
+		if (scan->xs_want_itup)
+		{
+			/*
+			 * Index-only scan.  Eagerly fetch visibility info from visibility
+			 * map for all batch item TIDs.
+			 */
+			heap_batch_resolve_visibility(scan, batch);
+		}
+
+		INDEX_SCAN_BATCH_APPEND(scan, batch);
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  ringbuf->headBatch, ringbuf->nextBatch, batch);
+	}
+
+	/* xs_hitup is not supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	batch_assert_batches_valid(scan);
+
+	return batch;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from batch ring buffer
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchRingItemPos *scanPos = &ringbuf->scanPos;
+	BatchIndexScan scanBatch = NULL;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* xs_hitup is not supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	/* Initialize direction on first call */
+	if (ringbuf->direction == NoMovementScanDirection)
+		ringbuf->direction = direction;
+
+	if (unlikely(ringbuf->direction != direction))
+	{
+		/* We may change direction after reading the last batch. */
+		scan->finished = false;
+	}
+
+	/*
+	 * Try advancing the position in the current batch. If that doesn't
+	 * succeed, it means we don't have more items in it, and we need to
+	 * advance to the next one (in the new scan direction).
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, scanPos->batch))
+	{
+		scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch);
+
+		if (heap_batchpos_advance(scanBatch, scanPos, direction))
+			return heapam_batch_return_tid(scan, scanBatch, scanPos);
+	}
+
+	if (unlikely(ringbuf->direction != direction))
+	{
+		/* XXX shouldn't heapam_batch_rewind update the scanPos too? */
+		heapam_batch_rewind(scan, ringbuf, direction);
+
+		/*
+		 * XXX It seems a bit weird to update just the batch part of the
+		 * scanPos. Doesn't it make it rather wrong, with the item still set
+		 * from the original batch? The next code block sets item too. So it
+		 * seems we're doing this only to "fake" the batch, and then the next
+		 * block will advance batch and reset the item. It's confusing, worth
+		 * documenting. Maybe we should set item=-1?
+		 */
+		scanPos->batch = ringbuf->nextBatch - 1;
+	}
+
+	/*
+	 * Ran out of items from scanBatch.  Try to advance it to next batch.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, scanPos->batch + 1))
+	{
+		/* Next batch already loaded by heapam_getnext_stream */
+		scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch + 1);
+	}
+	else if ((scanBatch = heap_batch_getnext(scan, scanBatch, direction)) != NULL)
+	{
+		/* Called amgetbatch again, loading new scanBatch into ring buffer */
+	}
+	else
+	{
+		/*
+		 * There are no more batches to be loaded in the current scan
+		 * direction.  Defensively reset the read position.
+		 */
+		batch_reset_pos(scanPos);
+		scan->finished = true;
+
+		return NULL;
+	}
+
+	/* Position scanPos to the start of new scanBatch */
+	heap_batchpos_newbatch(scanBatch, scanPos, direction);
+	Assert(INDEX_SCAN_BATCH(scan, scanPos->batch) == scanBatch);
+
+	/* Free now-unneeded older batch/prior scanBatch */
+	if (scanPos->batch != ringbuf->headBatch)
+	{
+		BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+													ringbuf->headBatch);
+
+		/* Free the head batch (except when it's markBatch) */
+		batch_free(scan, headBatch);
+
+		/*
+		 * In any case, remove the batch from the ring buffer, even if we kept
+		 * it for mark/restore
+		 */
+		ringbuf->headBatch++;
+
+		/* we can't skip any batches */
+		Assert(ringbuf->headBatch == scanPos->batch);
+	}
+
+	return heapam_batch_return_tid(scan, scanBatch, scanPos);
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->ringbuf)
+		{
+			if (all_dead)
+				index_batch_kill_item(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->ringbuf != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+			{
+				tid = index_getnext_tid(scan, direction);
+
+				/*
+				 * make sure to set the xs_visible flag, just like it's done
+				 * in heapam_batch_getnext_tid
+				 */
+				if ((tid != NULL) && (scan->xs_want_itup))
+					scan->xs_visible = VM_ALL_VISIBLE(scan->heapRelation,
+													  ItemPointerGetBlockNumber(tid),
+													  &hscan->vmbuf);
+			}
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!scan->xs_visible)
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->nheapfetches++;
+
+				if (!index_fetch_heap(scan, slot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(scan->xs_heap_continue &&
+						 IsMVCCSnapshot(scan->xs_snapshot)));
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but it's not clear whether it's a win
+				 * to do so.  The next index entry might require a visit to
+				 * the same heap page.
+				 */
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1289,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1327,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2647,6 +3185,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index a29be6f46..095761a8f 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->ringbuf = NULL;		/* set later for amgetbatch callers */
+	scan->xs_want_itup = false; /* caller must initialize this */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -102,8 +104,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	else
 		scan->orderByData = NULL;
 
-	scan->xs_want_itup = false; /* may be set later */
-
 	/*
 	 * During recovery we ignore killed tuples and don't bother to kill them
 	 * either. We do this because the xmin on the primary node could easily be
@@ -115,6 +115,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	 * should not be altered by index AMs.
 	 */
 	scan->kill_prior_tuple = false;
+	scan->dropPin = true;		/* for now */
+	scan->finished = false;
 	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
 	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
 
@@ -446,7 +448,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +519,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +710,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +737,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ed0508c6..9eaecd943 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool xs_want_itup,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -282,6 +281,10 @@ index_beginscan(Relation heapRelation,
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
@@ -380,6 +383,15 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * ringbuf shouldn't be marked finished (must make sure that
+	 * index_batch_reset doesn't see this, since indexam_util_batch_release
+	 * will be affected)
+	 */
+	scan->finished = false;
+
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +406,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +437,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +464,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +474,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +596,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +610,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool xs_want_itup,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -613,6 +633,10 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
@@ -621,10 +645,14 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +695,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..86a1c6f56
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,589 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static int	batch_compare_int(const void *va, const void *vb);
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch ring buffer structure and its initial read position.
+ * Also determines whether the scan will eagerly drop index page pins.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->ringbuf = palloc_object(BatchRingBuffer);
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->finished = false;
+	scan->ringbuf->direction = NoMovementScanDirection;
+
+	/* positions in the ring buffer of batches */
+	batch_reset_pos(&scan->ringbuf->scanPos);
+	batch_reset_pos(&scan->ringbuf->markPos);
+
+	scan->ringbuf->markBatch = NULL;
+	scan->ringbuf->headBatch = 0;	/* initial head batch */
+	scan->ringbuf->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->ringbuf->cache, 0, sizeof(scan->ringbuf->cache));
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch ring buffer and read position
+ *
+ * Resets all loaded batches in the ring buffer, and resets the read position
+ * to the initial state (or just initialize ring buffer state).  When
+ * 'complete' is true, also frees the scan's marked batch (if any), which is
+ * useful when ending an amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+
+	/* bail out if batching not enabled */
+	if (!ringbuf)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&ringbuf->scanPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the ring buffer), or by
+	 * explicitly freeing it.
+	 */
+	if (complete && unlikely(ringbuf->markBatch != NULL))
+	{
+		BatchRingItemPos *markPos = &ringbuf->markPos;
+		BatchIndexScan markBatch = ringbuf->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		ringbuf->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not loaded into
+		 * the ring buffer), free it explicitly now.  Otherwise, it'll be
+		 * freed along with the other loaded batches.
+		 */
+		if (!INDEX_SCAN_BATCH_LOADED(scan, markPos->batch))
+			batch_free(scan, markBatch);
+
+		batch_reset_pos(&ringbuf->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (ringbuf->headBatch < ringbuf->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, ringbuf->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", ringbuf->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		ringbuf->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	ringbuf->headBatch = 0;		/* initial batch */
+	ringbuf->nextBatch = 0;		/* initial batch is empty */
+
+	scan->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchRingItemPos *markPos = &ringbuf->markPos;
+	BatchIndexScan markBatch = ringbuf->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer loaded into the ring buffer
+	 */
+	if (markBatch && !INDEX_SCAN_BATCH_LOADED(scan, markPos->batch))
+	{
+		ringbuf->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the scan's position */
+	ringbuf->markPos = ringbuf->scanPos;
+	ringbuf->markBatch = INDEX_SCAN_BATCH(scan, ringbuf->markPos.batch);
+
+	/* scanPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &ringbuf->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch ring buffer state.  It might
+ * make sense to teach it to hold on to other nearby batches (still-held
+ * batches that are likely to be needed once the scan finishes returning
+ * matching items from the restored batch) as an optimization.  Such a scheme
+ * would have the benefit of avoiding repeat calls to amgetbatch/repeatedly
+ * reading the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+	BatchRingItemPos *markPos = &ringbuf->markPos;
+	BatchRingItemPos *scanPos = &ringbuf->scanPos ;
+	BatchIndexScan markBatch = ringbuf->markBatch;
+
+	if (scanPos->batch == markPos->batch &&
+		scanPos->batch == ringbuf->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		scanPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	ringbuf->scanPos = *markPos;
+	ringbuf->nextBatch = ringbuf->headBatch = markPos->batch;
+
+	INDEX_SCAN_BATCH_APPEND(scan, markBatch);
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->ringbuf->markBatch)
+		return;
+
+	/*
+	 * killedItems[] is now in whatever order the scan returned items in.
+	 * Scrollable cursor scans might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify killedItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numKilled > 1)
+	{
+		qsort(batch->killedItems, batch->numKilled, sizeof(int),
+			  batch_compare_int);
+		batch->numKilled = qunique(batch->killedItems, batch->numKilled,
+								   sizeof(int), batch_compare_int);
+	}
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in scanBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current scanBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchRingItemPos *scanPos = &scan->ringbuf->scanPos;
+	BatchIndexScan scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch);
+
+	batch_assert_pos_valid(scan, scanPos);
+
+	if (scanBatch->killedItems == NULL)
+		scanBatch->killedItems = palloc_array(int, scan->maxitemsbatch);
+	if (scanBatch->numKilled < scan->maxitemsbatch)
+		scanBatch->killedItems[scanBatch->numKilled++] = scanPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->ringbuf)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		BatchIndexScan cached = scan->ringbuf->cache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->killedItems)
+			pfree(cached->killedItems);
+		if (cached->currTuples)
+			pfree(cached->currTuples);
+		pfree(cached);
+	}
+
+	pfree(scan->ringbuf);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = scan->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->ringbuf)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		Assert(!scan->xs_want_itup);
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer;
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch during a amgetbatch (or amgetbitmap) index scan.
+ *
+ * Returns BatchIndexScan with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Index AMs that use batches should call this from either their amgetbatch or
+ * amgetbitmap routines only.  Note in particular that it cannot safely be
+ * called from a amfreebatch routine.
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from ring buffer */
+	if (scan->ringbuf != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->ringbuf->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->ringbuf->cache[i];
+				scan->ringbuf->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (scan->xs_want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->ringbuf)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+		}
+		else
+		{
+			/*
+			 * Use cache.  This is generally only beneficial when there are
+			 * many small rescans of an index.
+			 */
+			for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+			{
+				if (scan->ringbuf->cache[i] == NULL)
+				{
+					/* found empty slot, we're done */
+					scan->ringbuf->cache[i] = batch;
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchRingBuffer *ringbuf = scan->ringbuf;
+
+	if (!ringbuf)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  ringbuf->headBatch, ringbuf->nextBatch);
+
+	for (int i = ringbuf->headBatch; i < ringbuf->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc..231da20e4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -471,7 +471,7 @@ proper.  A plain index scan will even recognize LP_UNUSED items in the
 heap (items that could be recycled but haven't been just yet) as "not
 visible" -- even when the heap page is generally considered all-visible.
 
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+Opportunistic LP_DEAD setting of known-dead index tuples during index scans
 (described in full in simple deletion, below) is also more complicated for
 index scans that drop their leaf page pins.  We must be careful to avoid
 LP_DEAD-marking any new index tuple that looks like a known-dead index
@@ -481,7 +481,7 @@ new, unrelated index tuple, on the same leaf page, which has the same
 original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.
 
-We handle this kill_prior_tuple race condition by having affected index
+We handle this LP_DEAD setting race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
@@ -735,7 +735,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4125c185e..21bb80168 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 2ba1ca660..69cbe4784 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,96 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+	newbatch->items[itemIndex].allVisible = false;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+	item->allVisible = false;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+	item->allVisible = false;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2819,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2910,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee65..4877ccb48 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amfreebatch = btfreebatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -222,13 +223,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,45 +242,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * If we've already initialized this scan, we can just advance it in
 		 * the appropriate direction.  If we haven't done so yet, we call
-		 * _bt_first() to get the first item in the scan.
+		 * _bt_first() to get the first batch in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -289,6 +265,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -297,29 +274,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -347,8 +324,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -362,19 +337,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
 
 	return scan;
 }
@@ -388,72 +353,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -462,116 +392,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -887,15 +749,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1045,8 +898,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 32ae0bda8..ee1c96861 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,53 +26,23 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -861,20 +831,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -888,8 +854,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	BatchIndexScan firstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -905,6 +873,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
+		indexam_util_batch_release(scan, firstbatch);
 		return false;
 	}
 
@@ -914,7 +883,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return false;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -931,14 +903,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1238,7 +1206,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1506,12 +1474,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(firstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1527,23 +1495,24 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, firstbatch);
 			return false;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1561,168 +1530,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * Cancel primitive index scans that were scheduled when priorbatch's call
+	 * to _bt_readpage happened to use the opposite direction to the one that
+	 * we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1732,73 +1602,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1808,102 +1695,65 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior batch's nextPage or prevPage (depending on the current
+ * scan direction), and lastcurrblkno is the prior batch's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns batch containing data from the next page that has
+ * at least one matching item.  If there are no more matching items in the
+ * given scan direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for lastcurrblkno, a page to the left of
+	 * blkno (or to the right, when the scan is moving backwards)
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1911,17 +1761,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1936,19 +1786,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2174,25 +2044,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2203,7 +2071,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2229,9 +2097,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5c50f0dd1..b505d7876 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -21,15 +21,12 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -160,104 +157,69 @@ _bt_freestack(BTStack stack)
 	}
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
+	if (!scan->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -272,17 +234,16 @@ _bt_killitems(IndexScanDesc scan)
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in leaf page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
+			   offnum >= batch->items[batch->killedItems[i - 1]].indexOffset);
 
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
@@ -300,7 +261,7 @@ _bt_killitems(IndexScanDesc scan)
 
 				/*
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -316,7 +277,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum || !scan->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -332,8 +293,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * kitem is also the last heap TID in the last index tuple
 					 * correctly -- posting tuple still gets killed).
 					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+					if (pi < batch->numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -383,7 +344,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9f5379b87..a18a2fa9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb11168..ea0add1c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -135,7 +135,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1972,7 +1972,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,15 +1986,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3858,15 +3855,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				nheapfetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3887,6 +3885,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument.nsearches;
+				nheapfetches = indexstate->ioss_Instrument.nheapfetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3910,9 +3909,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			nheapfetches += winstrument->nheapfetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, nheapfetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 635679cc1..54c2403da 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -884,7 +884,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 90a68c0d1..0dfb01337 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 6ae0f9595..3041d5c72 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 72f2bff77..ab487b9e6 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,7 +205,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -213,7 +213,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -666,12 +666,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 058a59ef5..2580a0139 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -202,6 +202,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument.nsearches;
+		Assert(node->biss_Instrument.nheapfetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d093745..ff3e8f302 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -90,18 +87,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,78 +111,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -238,16 +163,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -407,13 +322,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -433,6 +341,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument.nsearches;
+		winstrument->nheapfetches += node->ioss_Instrument.nheapfetches;
 	}
 
 	/*
@@ -784,13 +693,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -850,12 +758,12 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 84823f0b6..c34a13a87 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -812,6 +812,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument.nsearches;
+		Assert(node->iss_Instrument.nheapfetches == 0);
 	}
 
 	/*
@@ -1719,7 +1720,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1784,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 51b9d6677..03598b50d 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a90e1c9ee..39a5ea299 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 0b1d80b5b..76b0d035f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -890,7 +890,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 29fec6555..a7213654a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7124,13 +7123,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
+	IndexScanInstrumentation instrument;
 
 	/*
 	 * We use the index-only-scan machinery for this.  With mostly-static
@@ -7179,56 +7175,32 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
-								 &SnapshotNonVacuumable, NULL,
+	/*
+	 * Set it up for instrumented index-only scan.  We need the
+	 * instrumentation to monitor the number of heap fetches.
+	 */
+	memset(&instrument, 0, sizeof(instrument));
+	index_scan = index_beginscan(heapRel, indexRel, true,
+								 &SnapshotNonVacuumable, &instrument,
 								 1, 0);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (instrument.nheapfetches > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7261,8 +7233,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..7fd98dba6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -146,10 +146,11 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..8df91bc38 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -167,10 +167,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -749,6 +750,138 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>IndexScanDescData.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>IndexScanDescData.dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN
+   check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +901,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -795,32 +928,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -994,30 +1120,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1186,6 +1329,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchRingBuffer</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 9eb8f0a6c..9d4fddec4 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,11 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 14dec2d49..d25ba3412 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -226,8 +226,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3471,12 +3469,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: index prefetching
@ 2026-01-18 23:51  Peter Geoghegan <[email protected]>
  parent: Peter Geoghegan <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Peter Geoghegan @ 2026-01-18 23:51 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Andres Freund <[email protected]>; Thomas Munro <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Robert Haas <[email protected]>; Melanie Plageman <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Tue, Jan 13, 2026 at 3:36 PM Peter Geoghegan <[email protected]> wrote:
> The batch stopped applying again. Attached is v7.

Attached is v8. A lot has changed in just the past week. We're feeling
optimistic about getting I/O prefetching into Postgres 19, and so
we've decided to fully commit ourselves to that.

All of the remaining kludges that we've relied on up until now have
either been removed, or replaced with a principled approach. Notable
changes in this v8 compared to v7:

* We no longer "reset" the read stream to deal with exhaustion of the
available slots for batches. We now rely on a patch of Thomas Munro's
[1] to pause the read stream from within the new heapam callback, in
the event of running out of batch slots.

When this happens we pause only for as long as it takes for the scan
to return enough items to make the scan need to advance scanPos to the
next batch in the ring buffer (obviously, this next batch must be
already loaded in this scenario). At that point we'll resume the read
stream, which will once again be able to call amgetbatch to store
another batch in the just-freed slot.

Running out of batch slots like this happens rarely, so it's important
to have a reasonably simple and elegant solution. Which this seems to
be. Importantly, there's no "ping ponging" behavior here.

* Removed grotty heuristics in our read stream call back to avoid regressions.

These first appeared many months ago, and no longer appear to be
necessary. There's still some regressions with pathological cases, but
they seem to be well within acceptable bounds. I saw about a 10%
remaining increase in query execution time for an adversarially
crafted query that Tomas came up with back in August [2]. That query
provided the original justification for my inventing those heuristics.

* We now have a real strategy around resource management and buffer pins.

A concern long held by Andres (and others) was that holding on to many
index leaf page buffer pins would somehow conflict with read stream's
careful management of heap page buffer pins [3]. We were particularly
concerned about hard to hit cases where the read stream is somehow
limited by recently acquired buffer pins for index pages. This was
probably the thing that made me most doubtful about our ability to get
the prefetching patch in shape for Postgres 19. But that's completely
changed in just the past week.

Dropping leaf page buffer pins during index-only scans
------------------------------------------------------

I realized (following a point made by Matthias van de Meent during a
private call) that it simply isn't necessary for index pages to hold
on to buffer pins, in general. We haven't actually done that with
nbtree plain index scans with MVCC snapshots since 2015's commit
2ed5b87f, which added what we now call nbtree's dropPin optimization.
What if we could find a way to teach *every* amgetbatch-supporting
index AM to do the same trick for *all* scans (barring non-MVCC
snapshot scans)? Including index-only scans and scans of unlogged
relations? Then we'd have zero chance of unintended interactions with
the read stream; there'd simply be no extra buffer pins that might
confuse the read stream in the first place!

Matthias told me that his patch to fix the bugs in GiST index-only
scans works by not dropping a pin on a GiST leaf page right away. It
delays dropping such pins, but only very slightly: if we cache
visibility information from the VM (which we're doing already in the
amgetbatch patch, and which Matthias' patch does too), and delay
dropping a batch's leaf page pin until after its VM cache is loaded,
it reliably avoids races of the kind that we need to be worried about
here. In short, we can eagerly drop buffer pins during index-only
scans in the same way (or virtually the same way) that we've long been
able to with nbtree plain index scans thanks to nbtree's dropPin
optimization.

The race in question involves VACUUM concurrently setting a VM page
all visible on a heap page with a TID that is also recycled by VACUUM
(as it sets its page all-visible). We can safely allow VACUUM to go
ahead with this while still dropping our pin early -- provided we build our
local cache of visibility information first. Holding on to a leaf page
pin while reading from the VM suffices. The important principle is
that our local cache of VM info is (and will remain) consistent with
what we saw on the index page when we saved its matching TIDs into a
batch. (It doesn't matter that we do heap fetches for now-all-visible
pages, because they cannot possibly be visible to the scan's MVCC
snapshot. Just like in the plain index scan dropPin case. And rather
like bitmap index scans.)

v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch has a
new isolation test that demonstrates the new "drop pins eagerly during
index-only scans" behavior, which is named
index-only-scan-visibility.spec. The isolation test is a variant of
the one I posted on the GiST thread, which proved that GiST is broken
here (a problem that Matthias is working on fixing). If you attempt to run this
isolation test on master, it'll block forever; VACUUM can never acquire a
cleanup lock due to a conflicting buffer pin held by an index-only scan.
That doesn't happen with v8 of the patch, though; it completes in less than
20ms on my system (and the scan actually gives correct results!).

This still leaves non-MVCC snapshot scans. There's nothing we can do
to avoid holding on to a leaf page buffer pin while accessing the heap
there. But that's okay; now we just refuse to do I/O prefetching
during such scans.

Dropping leaf page buffer pins during scans of an unlogged relation
-------------------------------------------------------------------

Another thing that hinders nbtree's dropPin optimization (and that we
must deal with to get a guarantee that leaf page buffer pins never
really need to be kept around) is the use of an unlogged relation.
That breaks dropPin's approach to detecting unsafe concurrent TID
recycling when marking dead items LP_DEAD on index pages, since that
involves stashing a page LSN, and then checking if it has changed
later on.

We solve that problem by introducing GiST style "fake LSNs" to both
ntbree and hash. Now the same LSN trick works for unlogged relations,
too.

We now require that any other amgetbatch index AMs that might be added
in the future also use fake LSNs like this. Alternatively, such an
index AM could just not provide a _bt_killitems-like mechanism at all
-- that also works. Or, they could limit the use of such a mechanism
to logged relations. Third party table AMs don't really need to deal with
this themselves, though.

Performance impact of calling BufferGetLSNAtomic during affected scans
----------------------------------------------------------------------

Our expanded use of BufferGetLSNAtomic() during index-only scans has
the potential to cause regressions, at least when page checksums are
enabled. We're planning on relying on a patch of Andreas Karlsson's to
make BufferGetLSNAtomic use an atomic op [4], which fixes this
regression. I'm not including that here, though (I would but for the
fact that it breaks the Debian Trixie CI target due to a known
misaligned access bug that Andreas is working on fixing). Anybody that
does performance validation of either index-only scans or scans of an
unlogged relation should bear that in mind.

> > I still haven't had time to produce an implementation of the "heap
> > buffer locking minimization" optimization that's clean enough to
> > present to the list.
>
> Still haven't done this.

This idea has now been deprioritized. Quite a few things have fallen
together recently, so we're "pivoting back" to work on prefetching for
Postgres 19.

[1] https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
[2] https://postgr.es/m/[email protected]
[3] https://postgr.es/m/mc5w6mj52dzl7ant7nmjmwxjmvmlwekwjmf77eotrra3pghrfl@d7mq3hxvdapa
[4] https://postgr.es/m/[email protected]

--
Peter Geoghegan


Attachments:

  [application/x-patch] v8-0006-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch (6.9K, 2-v8-0006-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch)
  download | inline diff:
From 443b9f96bec1204491f0db556ec03b7fb731965a Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v8 6/8] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a40adf6b2..1358fc7fa 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 275cd4032..6e093850d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1657,6 +1657,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1789,7 +1829,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1808,11 +1848,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u64(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1898,6 +1993,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1955,9 +2087,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0



  [application/x-patch] v8-0008-Make-hash-index-AM-use-amgetbatch-interface.patch (36.4K, 3-v8-0008-Make-hash-index-AM-use-amgetbatch-interface.patch)
  download | inline diff:
From 6bf866debf1dbb2209bd8dda2e73583f6bf2127a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v8 8/8] Make hash index AM use amgetbatch interface.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface.  Plain index scans of hash indexes now return
matching items in batches consisting of all of the matches from a given
bucket or overflow page.  This gives the core executor the ability to
perform optimizations like index prefetching during hash index scans.

Note that this makes hash index scans use the dropPin optimization,
since that is required to use the amgetbatch interface.  This won't
avoid making hash index vacuuming wait for a cleanup lock when an index
scan holds onto a conflicting pin, since such index scans still need to
hold onto a conflicting bucket page pin for the duration of the scan.
In other words, use of the dropPin optimization won't benefit hash index
scans in the same way that it benefits nbtree index scans following
commit 2ed5b87f.  However, there is still some value in keeping the
number of buffer pins held at any one time to a minimum.

Index prefetching tends to hold open as many as several dozen batches
with certain workloads (workloads where the stream position has to get
quite far ahead of the read position in order to maintain the
appropriate prefetch distance on the heapam side).  Guaranteeing that
open batches won't hold buffer pins on index pages (at least in the
common case where the dropPin optimization is safe to use) thereby
simplifies resource management during index prefetching.

Also add Valgrind buffer lock instrumentation to hash, bringing it in
line with nbtree following commit 4a70f829.  This is another requirement
when using the amgetbatch interface.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  73 +-----
 src/backend/access/hash/hash.c       | 124 ++++------
 src/backend/access/hash/hashpage.c   |  26 +--
 src/backend/access/hash/hashsearch.c | 331 +++++++++++----------------
 src/backend/access/hash/hashutil.c   |  83 +++----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 236 insertions(+), 403 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..fc66465a2 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch hashgetbatch(IndexScanDesc scan,
+								   IndexScanBatch priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 IndexScanBatch priorbatch);
+extern IndexScanBatch _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, IndexScanBatch batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d742dabcd..ddb84719e 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amfreebatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amfreebatch = hashfreebatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -286,53 +286,27 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+hashgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	Relation	rel = scan->indexRelation;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
+		_hash_dropscanbuf(rel, so);
 
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +316,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,17 +354,14 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
 
 	return scan;
 }
@@ -408,18 +376,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +386,27 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (BufferIsValid(batch->buf))
+	{
+		/* table AM didn't unpin page earlier -- do it now */
+		Assert(!scan->MVCCScan);
+
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +416,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 263bc73f1..2bc8bea22 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -35,6 +35,7 @@
 #include "port/pg_bitutils.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
@@ -79,6 +80,9 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 	if (access != HASH_NOLOCK)
 		LockBuffer(buf, access);
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -108,6 +112,9 @@ _hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
 		return InvalidBuffer;
 	}
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -280,31 +287,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 89d1c5bc6..49ddcf3a2 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,105 +22,83 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   IndexScanBatch batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   IndexScanBatch batch);
+static inline void _hash_saveitem(IndexScanBatch batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_hash_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	IndexScanBatch batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
+			return NULL;
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
+			return NULL;
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	else
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -270,22 +248,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+IndexScanBatch
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -296,7 +273,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	IndexScanBatch batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -326,7 +303,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -419,191 +396,159 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released.
+		 */
+		IncrBufferRefCount(batch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	batch->moreLeft = BlockNumberIsValid(batch->prevPage);
+	batch->moreRight = BlockNumberIsValid(batch->nextPage);
+
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   IndexScanBatch batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -640,7 +585,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -687,7 +632,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -706,13 +651,15 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(IndexScanBatch batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
+	currItem->allVisible = false;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index cf7f0b901..e9ba9b1f3 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,81 +510,59 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page.  When we
+ * return, it still won't be locked.  It'll continue to hold whatever pins
+ * were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  We must condition setting LP_DEAD bits on the page LSN having not
+ * changed since back when _hash_readpage saw the page.  We totally give up on
+ * setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, IndexScanBatch batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
+	XLogRecPtr	latestlsn;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+					   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
 	{
-		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
-		 */
-		havePin = true;
-		buf = so->currPos.buf;
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		/* Modified, give up on hinting */
+		_hash_relbuf(rel, buf);
+		return;
 	}
-	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	for (i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in index page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -594,6 +572,7 @@ _hash_kill_items(IndexScanDesc scan)
 			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
+				Assert(!currItem->allVisible);
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -613,9 +592,5 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-	else
-		_hash_relbuf(rel, buf);
+	_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fa84abfc5..eeb7ecd86 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1201,8 +1201,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0



  [application/x-patch] v8-0007-Add-fake-LSN-support-to-hash-index-AM.patch (13.7K, 4-v8-0007-Add-fake-LSN-support-to-hash-index-AM.patch)
  download | inline diff:
From 483025ca5cd6adbdb87794cfcf1968becb35bd86 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:32:52 -0500
Subject: [PATCH v8 7/8] Add fake LSN support to hash index AM.

This is preparation for an upcoming patch that will add the amgetbatch
interface and switch hash over to it (from amgettuple).  We need fake
LSNs to make it safe to apply behavior that is equivalent to nbtree's
previous dropPin behavior that works with unlogged hash index scans.

The commit that will add hashgetbatch makes the required changes to
_hash_kill_items that actually relies on hash generating fake LSNs.

Author: Peter Geoghegan <[email protected]>
---
 src/backend/access/hash/hash.c       |  21 +++--
 src/backend/access/hash/hashinsert.c |  20 +++--
 src/backend/access/hash/hashovfl.c   | 111 ++++++++++++++++-----------
 src/backend/access/hash/hashpage.c   |  22 +++---
 4 files changed, 105 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6a20b67f6..d742dabcd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -476,6 +476,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	XLogRecPtr	recptr;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -615,7 +616,6 @@ loop_top:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_update_meta_page xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.ntuples = metap->hashm_ntuples;
 
@@ -625,8 +625,11 @@ loop_top:
 		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -699,6 +702,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	XLogRecPtr	recptr;
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -821,7 +825,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			if (RelationNeedsWAL(rel))
 			{
 				xl_hash_delete xlrec;
-				XLogRecPtr	recptr;
 
 				xlrec.clear_dead_marking = clear_dead_marking;
 				xlrec.is_primary_bucket_page = (buf == bucket_buf);
@@ -846,8 +849,11 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 									ndeletable * sizeof(OffsetNumber));
 
 				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
-				PageSetLSN(BufferGetPage(buf), recptr);
 			}
+			else
+				recptr = XLogGetFakeLSN(rel);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
 
 			END_CRIT_SECTION();
 		}
@@ -906,14 +912,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		/* XLOG stuff */
 		if (RelationNeedsWAL(rel))
 		{
-			XLogRecPtr	recptr;
-
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
 
 			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
-			PageSetLSN(page, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(page, recptr);
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 0cefbacc9..3395bbc13 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -50,6 +50,7 @@ _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted)
 	uint32		hashkey;
 	Bucket		bucket;
 	OffsetNumber itup_off;
+	XLogRecPtr	recptr;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -216,7 +217,6 @@ restart_insert:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_insert xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.offnum = itup_off;
 
@@ -229,10 +229,12 @@ restart_insert:
 		XLogRegisterBufData(0, itup, IndexTupleSize(itup));
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
-
-		PageSetLSN(BufferGetPage(buf), recptr);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(buf), recptr);
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -372,6 +374,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	Page		page = BufferGetPage(buf);
 	HashPageOpaque pageopaque;
 	HashMetaPage metap;
+	XLogRecPtr	recptr;
 
 	/* Scan each tuple in page to see if it is marked as LP_DEAD */
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -424,7 +427,6 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		if (RelationNeedsWAL(rel))
 		{
 			xl_hash_vacuum_one_page xlrec;
-			XLogRecPtr	recptr;
 
 			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
@@ -445,10 +447,12 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
 
 			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
-
-			PageSetLSN(BufferGetPage(buf), recptr);
-			PageSetLSN(BufferGetPage(metabuf), recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
 
 		END_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 8cfb6ce75..abd1f91fa 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -132,6 +132,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	uint32		i,
 				j;
 	bool		page_found = false;
+	XLogRecPtr	recptr;
 
 	/*
 	 * Write-lock the tail page.  Here, we need to maintain locking order such
@@ -381,7 +382,6 @@ found:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_hash_add_ovfl_page xlrec;
 
 		xlrec.bmpage_found = page_found;
@@ -408,18 +408,20 @@ found:
 		XLogRegisterBufData(4, &metap->hashm_firstfree, sizeof(uint32));
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
-
-		PageSetLSN(BufferGetPage(ovflbuf), recptr);
-		PageSetLSN(BufferGetPage(buf), recptr);
-
-		if (BufferIsValid(mapbuf))
-			PageSetLSN(BufferGetPage(mapbuf), recptr);
-
-		if (BufferIsValid(newmapbuf))
-			PageSetLSN(BufferGetPage(newmapbuf), recptr);
-
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(ovflbuf), recptr);
+	PageSetLSN(BufferGetPage(buf), recptr);
+
+	if (BufferIsValid(mapbuf))
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+	if (BufferIsValid(newmapbuf))
+		PageSetLSN(BufferGetPage(newmapbuf), recptr);
+
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -510,7 +512,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Bucket		bucket PG_USED_FOR_ASSERTS_ONLY;
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
-	bool		update_metap = false;
+	bool		update_metap = false,
+				mod_wbuf,
+				is_prim_bucket_same_wrt,
+				is_prev_bucket_same_wrt;
+	XLogRecPtr	recptr;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -641,19 +647,21 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		MarkBufferDirty(metabuf);
 	}
 
+	/* Determine which pages WAL record modifies */
+	mod_wbuf = false;
+	is_prim_bucket_same_wrt = (wbuf == bucketbuf);
+	is_prev_bucket_same_wrt = (wbuf == prevbuf);
+
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_squeeze_page xlrec;
-		XLogRecPtr	recptr;
-		int			i;
-		bool		mod_wbuf = false;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
 		xlrec.ntups = nitups;
-		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf);
-		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf);
+		xlrec.is_prim_bucket_same_wrt = is_prim_bucket_same_wrt;
+		xlrec.is_prev_bucket_same_wrt = is_prev_bucket_same_wrt;
 
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHashSqueezePage);
@@ -662,14 +670,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		 * bucket buffer was not changed, but still needs to be registered to
 		 * ensure that we can acquire a cleanup lock on it during replay.
 		 */
-		if (!xlrec.is_prim_bucket_same_wrt)
+		if (!is_prim_bucket_same_wrt)
 		{
 			uint8		flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
 
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		if (xlrec.ntups > 0)
+		if (nitups > 0)
 		{
 			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 
@@ -678,10 +686,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 			XLogRegisterBufData(1, itup_offsets,
 								nitups * sizeof(OffsetNumber));
-			for (i = 0; i < nitups; i++)
+			for (int i = 0; i < nitups; i++)
 				XLogRegisterBufData(1, itups[i], tups_size[i]);
 		}
-		else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+		else if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt)
 		{
 			uint8		wbuf_flags;
 
@@ -691,10 +699,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			 * if it is the same as primary bucket buffer or update the
 			 * nextblkno if it is same as the previous bucket buffer.
 			 */
-			Assert(xlrec.ntups == 0);
+			Assert(nitups == 0);
 
 			wbuf_flags = REGBUF_STANDARD;
-			if (!xlrec.is_prev_bucket_same_wrt)
+			if (!is_prev_bucket_same_wrt)
 			{
 				wbuf_flags |= REGBUF_NO_CHANGE;
 			}
@@ -714,7 +722,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		 * prevpage.  During replay, we can directly update the nextblock in
 		 * writepage.
 		 */
-		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+		if (BufferIsValid(prevbuf) && !is_prev_bucket_same_wrt)
 			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
 
 		if (BufferIsValid(nextbuf))
@@ -730,23 +738,33 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		}
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
-
-		/* Set LSN iff wbuf is modified. */
-		if (mod_wbuf)
-			PageSetLSN(BufferGetPage(wbuf), recptr);
-
-		PageSetLSN(BufferGetPage(ovflbuf), recptr);
-
-		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
-			PageSetLSN(BufferGetPage(prevbuf), recptr);
-		if (BufferIsValid(nextbuf))
-			PageSetLSN(BufferGetPage(nextbuf), recptr);
-
-		PageSetLSN(BufferGetPage(mapbuf), recptr);
-
-		if (update_metap)
-			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else						/* !RelationNeedsWAL(rel) */
+	{
+		recptr = XLogGetFakeLSN(rel);
+
+		/* Determine if wbuf is modified */
+		if (nitups > 0)
+			mod_wbuf = true;
+		else if (is_prev_bucket_same_wrt)
+			mod_wbuf = true;
+	}
+
+	/* Set LSN iff wbuf is modified. */
+	if (mod_wbuf)
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+
+	PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+	if (BufferIsValid(prevbuf) && !is_prev_bucket_same_wrt)
+		PageSetLSN(BufferGetPage(prevbuf), recptr);
+	if (BufferIsValid(nextbuf))
+		PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+	PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+	if (update_metap)
+		PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -959,6 +977,8 @@ readpage:
 
 				if (nitups > 0)
 				{
+					XLogRecPtr	recptr;
+
 					Assert(nitups == ndeletable);
 
 					/*
@@ -986,7 +1006,6 @@ readpage:
 					/* XLOG stuff */
 					if (RelationNeedsWAL(rel))
 					{
-						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
 
 						xlrec.ntups = nitups;
@@ -1018,10 +1037,12 @@ readpage:
 											ndeletable * sizeof(OffsetNumber));
 
 						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
-
-						PageSetLSN(BufferGetPage(wbuf), recptr);
-						PageSetLSN(BufferGetPage(rbuf), recptr);
 					}
+					else
+						recptr = XLogGetFakeLSN(rel);
+
+					PageSetLSN(BufferGetPage(wbuf), recptr);
+					PageSetLSN(BufferGetPage(rbuf), recptr);
 
 					END_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8e220a3ae..263bc73f1 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -630,6 +630,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		lowmask;
 	bool		metap_update_masks = false;
 	bool		metap_update_splitpoint = false;
+	XLogRecPtr	recptr;
 
 restart_expand:
 
@@ -900,7 +901,6 @@ restart_expand:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_split_allocate_page xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.new_bucket = maxbucket;
 		xlrec.old_bucket_flag = oopaque->hasho_flag;
@@ -933,11 +933,13 @@ restart_expand:
 		XLogRegisterData(&xlrec, SizeOfHashSplitAllocPage);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
-
-		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
-		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+	PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -1092,6 +1094,7 @@ _hash_splitbucket(Relation rel,
 	Size		all_tups_size = 0;
 	int			i;
 	uint16		nitups = 0;
+	XLogRecPtr	recptr;
 
 	bucket_obuf = obuf;
 	opage = BufferGetPage(obuf);
@@ -1296,7 +1299,6 @@ _hash_splitbucket(Relation rel,
 
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_hash_split_complete xlrec;
 
 		xlrec.old_bucket_flag = oopaque->hasho_flag;
@@ -1310,10 +1312,12 @@ _hash_splitbucket(Relation rel,
 		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
-
-		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
-		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+	PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
 
 	END_CRIT_SECTION();
 
-- 
2.51.0



  [application/x-patch] v8-0005-Add-prefetching-to-index-scans-using-batch-interf.patch (52.8K, 5-v8-0005-Add-prefetching-to-index-scans-using-batch-interf.patch)
  download | inline diff:
From 5099cca54e8e1d0f9731f4c2151427268e3e17d5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v8 5/8] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of matching TIDs
(rather than one tuple at a time), each of which must be taken from
index tuples that appear together on a single index page.  This allows
multiple batches to be held open simultaneously.  Giving the table AM an
explicit understanding of index AM concepts/index page boundaries allows
it to consider all of the relevant costs and benefits.

Prefetching is implemented using a prefetching position under the
control of the table AM and core code.  This is closely related to the
scan position added by commit FIXME, which introduced the amgetbatch
interface.  A read stream callback advances the read stream as needed to
provide sufficiently many heap block numbers to maintain the read
stream's target prefetch distance.

An important goal of the amgetbatch design is to enable the table AM's
read stream callback to advance its prefetch position using TIDs that
appear on a leaf page that's ahead of the current scan position's leaf
page.  This is crucial with scans of indexes where each leaf page
happens to have relatively few distinct heap blocks among its matching
TIDs (as well as with scans with leaf pages that have relatively few
total matching items).  Index scans can have as many as 64 open batches,
which testing has shown to be about the maximum number that can ever be
useful.  Batches are maintained in scan order using a simple ring buffer
data structure.

In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
the read stream is temporarily paused.  Prefetching (via the read
stream) is resumed only after the scan position advances beyond its
current open batch and then frees it by calling amfreebatch and removing
it from the scan's batch ring buffer.  Testing has shown that it isn't
very common for scans to hold open more than about 10 batches to get the
desired I/O prefetch distance.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/include/access/relscan.h                  |  39 +-
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 284 ++++++++++++-
 src/backend/access/index/indexbatch.c         |  16 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  16 +
 doc/src/sgml/indexam.sgml                     | 372 +++++++++++++++---
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 11 files changed, 688 insertions(+), 65 deletions(-)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ae85e8ddc..260c3c52c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 /*
@@ -227,8 +229,14 @@ typedef struct IndexScanBatchData *IndexScanBatch;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchringbuf->nextBatch - (scan)->batchringbuf->headBatch)
@@ -270,6 +278,14 @@ typedef struct IndexScanBatchData *IndexScanBatch;
  * matches in.  However, table AMs are free to fetch table tuples in whatever
  * order is most convenient/efficient -- provided that such reordering cannot
  * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
  */
 typedef struct BatchRingBuffer
 {
@@ -279,6 +295,7 @@ typedef struct BatchRingBuffer
 	/* current positions in batches[] for scan */
 	BatchRingItemPos scanPos;	/* scan's read position */
 	BatchRingItemPos markPos;	/* mark/restore position */
+	BatchRingItemPos prefetchPos;	/* prefetching position */
 
 	IndexScanBatch markBatch;
 
@@ -297,6 +314,26 @@ typedef struct BatchRingBuffer
 	/* Array of pointers to ring buffer batches */
 	IndexScanBatch batches[INDEX_SCAN_MAX_BATCHES];
 
+	/*
+	 * Prefetching related state.
+	 *
+	 * XXX Should we move this to a heapam struct, such as IndexFetchHeapData?
+	 *
+	 * currentPrefetchBlock is the table AM block number that was returned by
+	 * its read stream callback most recently.  Used to suppress duplicate
+	 * successive read stream block requests.
+	 *
+	 * Occasionally, the read stream callback will request another table block
+	 * when the scan has already stored INDEX_SCAN_MAX_BATCHES-many batches.
+	 * The paused flag can set to remember that the callback had to return
+	 * read_stream_pause() (rather than the next block in line to be read).
+	 * When the scan can subsequently consumes enough scanPos items to make it
+	 * safe to free another batch, it must check this flag.  If the flag is
+	 * set, then the scan should call read_stream_resume (and unset the flag).
+	 */
+	BlockNumber currentPrefetchBlock;
+	bool		paused;
+
 } BatchRingBuffer;
 
 struct IndexScanInstrumentation;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 07b8bfa63..60dc4ea3b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 493d6cb72..761c2375a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -37,6 +37,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -60,6 +61,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -85,6 +89,7 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
@@ -97,6 +102,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
@@ -113,6 +121,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -146,7 +157,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		/*
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
+		 */
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		Assert(BufferIsValid(hscan->xs_cbuf));
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -244,6 +265,15 @@ heapam_batch_return_tid(IndexScanDesc scan, IndexScanBatch scanBatch,
 /*
  * heap_batch_resolve_visibility
  *		Obtain visibility information for every TID from caller's batch.
+ *
+ * heapam_batch_getnext_tid must reliably agree with heapam_getnext_stream
+ * about which heap blocks/TIDs will require a heap fetch (and which TIDs
+ * won't due to pointing to an all-visible heap page).  Otherwise we risk
+ * allowing the read stream to return unexpected heap buffers/pages.
+ *
+ * Caching visibility information up front avoids that problem.  If a VM bit
+ * is concurrently set (or unset), it can't matter, since everybody will have
+ * works off of this immutable local cache.
  */
 static void
 heap_batch_resolve_visibility(IndexScanDesc scan, IndexScanBatch batch)
@@ -304,6 +334,7 @@ heap_batch_getnext(IndexScanDesc scan, IndexScanBatch priorbatch,
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
+	Assert(!batchringbuf->paused);
 
 	/*
 	 * When caller provides a priorbatch it had better be for the last valid
@@ -352,6 +383,29 @@ heap_batch_getnext(IndexScanDesc scan, IndexScanBatch priorbatch,
 			ReleaseBuffer(batch->buf);
 			batch->buf = InvalidBuffer;
 		}
+
+		/*
+		 * Delay initializing stream until reading from scan's second batch.
+		 * This heuristic avoids wasting cycles on starting a read stream for
+		 * very selective index scans.  We can likely improve upon this, but
+		 * it works well enough for now.
+		 *
+		 * Also avoid prefetching during scans where we're unable to drop each
+		 * batch's buffer pin right away (non-MVCC snapshot scans).  We are
+		 * not prepared to sensibly limit the total number of buffer pins held
+		 * (read stream handles all pin resource management for us, and knows
+		 * nothing about pins held on index pages/within batches).
+		 */
+		if (!scan->xs_heapfetch->rs && priorbatch && scan->MVCCScan &&
+			enable_indexscan_prefetch)
+		{
+			Assert(INDEX_SCAN_POS_INVALID(&batchringbuf->prefetchPos));
+
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   heapam_getnext_stream, scan, 0);
+		}
 	}
 
 	/* xs_hitup is not supported by amgetbatch scans */
@@ -383,14 +437,34 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* xs_hitup is not supported by amgetbatch scans */
 	Assert(!scan->xs_hitup);
 
+	/* scan should only be paused when there's no free batch slots */
+	Assert(!batchringbuf->paused || INDEX_SCAN_BATCH_FULL(scan));
+
 	/* Initialize direction on first call */
 	if (batchringbuf->direction == NoMovementScanDirection)
 		batchringbuf->direction = direction;
 
+	/*
+	 * XXX Shouldn't this also update the batchringbuf->direction? If we get
+	 * to the next block hangling direction change, then we will remember it
+	 * (because heapam_batch_rewind will store it). But if we return in the
+	 * next block, won't we forget about it?
+	 *
+	 * XXX It's a bit weird we handle the direction change in two places.
+	 * Would be good to explain why that's necessary.
+	 *
+	 * XXX How come this doesn't need to do heapam_batch_rewind too? Could
+	 * there be some future batches already loaded?
+	 */
 	if (unlikely(batchringbuf->direction != direction))
 	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+		batch_reset_pos(&batchringbuf->prefetchPos);
+
 		/* We may change direction after reading the last batch. */
 		scan->finished = false;
+		batchringbuf->paused = false;
 	}
 
 	/*
@@ -427,6 +501,7 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	 */
 	if (INDEX_SCAN_BATCH_LOADED(scan, scanPos->batch + 1))
 	{
+		/* Next batch already loaded by heapam_getnext_stream */
 		scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch + 1);
 	}
 	else if ((scanBatch = heap_batch_getnext(scan, scanBatch, direction)) != NULL)
@@ -466,11 +541,218 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 		/* we can't skip any batches */
 		Assert(batchringbuf->headBatch == scanPos->batch);
+
+		if (batchringbuf->paused)
+		{
+			/*
+			 * The scan's read stream was paused by heapam_getnext_stream due
+			 * to exhausting all available free batch slots.  We just freed up
+			 * one such slot now, though.  Resume the read stream to re-enable
+			 * prefetching.
+			 */
+			Assert(!INDEX_SCAN_BATCH_FULL(scan));
+			read_stream_resume(scan->xs_heapfetch->rs);
+			batchringbuf->paused = false;
+		}
 	}
 
 	return heapam_batch_return_tid(scan, scanBatch, scanPos);
 }
 
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * The initial batch is always loaded by heapam_batch_getnext_tid.  We don't
+ * get called until the first read_stream_next_buffer() call, when a heap
+ * block is requested from the scan's stream for the first time.
+ *
+ * The position of the read_stream is stored in prefetchPos.  It is typical for
+ * prefetchPos to consistently stay ahead of the scanPos position that's used to
+ * track the next TID to be returned to the scan by heapam_batch_getnext_tid
+ * after the first time we get called.  However, that isn't a precondition.
+ * There is a strict postcondition, though: when we return we'll always leave
+ * scanPos <= prefetchPos (except in cases where we return InvalidBlockNumber).
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
+	ScanDirection direction = batchringbuf->direction;
+	IndexScanBatch prefetchBatch;
+	bool		fromScanPos = false;
+
+	/*
+	 * It is possible for the scan's direction to change, but that's handled
+	 * elsewhere.  We don't know how to deal with any variation in scan
+	 * direction here.  We assume that all loaded and newly requested batches
+	 * must use the same scan direction.
+	 */
+	Assert(direction != NoMovementScanDirection);
+	Assert(!batchringbuf->paused);
+
+	if (scan->finished)
+		return InvalidBlockNumber;
+
+	/*
+	 * scanPos must always be valid when we're called -- there has to be at
+	 * least one batch, loaded, for scanBatch.  prefetchPos might not yet be
+	 * valid, in which case it'll be initialized using scanPos.
+	 */
+	Assert(INDEX_SCAN_BATCH_COUNT(scan) > 0);
+	batch_assert_pos_valid(scan, scanPos);
+
+	/*
+	 * If prefetchPos has not been initialized yet, that typically indicates
+	 * that this is the first call here for the entire scan (barring changes
+	 * in scan direction with a scrollable cursor).  We initialize prefetchPos
+	 * using the current scanPos, since the current scanBatch item's TID
+	 * should have its block number returned by the read stream first.  It's
+	 * likely that prefetchPos will get ahead of scanPos before long, but that
+	 * hasn't happened yet.
+	 *
+	 * It's also possible for prefetchPos to "fall behind" scanPos, at least
+	 * in a trivial sense: if many adjacent items are returned that contain
+	 * TIDs that point to the same heap block, scanPos can actually overtake
+	 * prefetchPos (prefetchPos can't advance until the scan actually calls
+	 * read_stream_next_buffer).  Usually this doesn't require any special
+	 * handling; the standard currentPrefetchBlock tests in the loop below
+	 * will increment prefetchPos until it catches up with scanPos once again.
+	 * However, that can't work when prefetchPos falls so far behind that its
+	 * batch gets freed.  We handle that case here, too.
+	 *
+	 * This !INDEX_SCAN_BATCH_LOADED() case can be handled using the standard
+	 * approach to initializing prefetchPos during the scan's first call here.
+	 * In effect, this is an alternative way for prefetchPos to catch up with
+	 * scanPos -- one that doesn't rely on prefetchPos->batch staying around.
+	 */
+	if (INDEX_SCAN_POS_INVALID(prefetchPos) ||
+		!INDEX_SCAN_BATCH_LOADED(scan, prefetchPos->batch))
+	{
+		batchringbuf->currentPrefetchBlock = InvalidBlockNumber;
+		*prefetchPos = *scanPos;
+		fromScanPos = true;
+	}
+
+	prefetchBatch = INDEX_SCAN_BATCH(scan, prefetchPos->batch);
+	for (;;)
+	{
+		BatchMatchingItem *item;
+		BlockNumber prefetchBlock;
+
+		if (fromScanPos)
+		{
+			/*
+			 * Don't increment item when prefetchPos was just initialized
+			 * using scanPos.  We'll return the scanPos item's heap block
+			 * directly on the first call here.  In other words, we'll return
+			 * the heap block for the TID passed to heapam_index_fetch_tuple
+			 * at the point where it called read_stream_next_buffer for the
+			 * first time during the scan.
+			 */
+			fromScanPos = false;
+		}
+		else if (!index_batchpos_advance(prefetchBatch, prefetchPos, direction))
+		{
+			/*
+			 * Ran out of items from prefetchBatch.  Try to advance it to next
+			 * batch.
+			 */
+			if (INDEX_SCAN_BATCH_LOADED(scan, prefetchPos->batch + 1))
+			{
+				/*
+				 * The next batch was already loaded for us.
+				 *
+				 * Typically, prefetchPos is ahead of scanPos for the entire
+				 * duration of the scan (at least after we're first called).
+				 * However, prefetchPos can sometimes fall behind scanPos.
+				 * That's why we need to handle already-loaded batches here.
+				 *
+				 * This happens when some blocks are skipped and not returned
+				 * to the read_stream.  An example is an index scan on a
+				 * correlated index, with many duplicate blocks are skipped,
+				 * or an IOS where all-visible blocks are skipped.
+				 */
+				prefetchBatch = INDEX_SCAN_BATCH(scan, prefetchPos->batch + 1);
+			}
+			else
+			{
+				/*
+				 * If we already used the maximum number of batch slots
+				 * available, it's pointless to try loading another one. This
+				 * can happen for various reasons, e.g. for index-only scans
+				 * on all-visible table, or skipping duplicate blocks on
+				 * perfectly correlated indexes, etc.
+				 */
+				if (INDEX_SCAN_BATCH_FULL(scan))
+				{
+					batchringbuf->paused = true;
+					return read_stream_pause(stream);
+				}
+
+				prefetchBatch = heap_batch_getnext(scan, prefetchBatch, direction);
+				if (!prefetchBatch)
+				{
+					/*
+					 * Failed to load next batch, so all the batches that the
+					 * scan will ever require (barring a change in scan
+					 * direction) are now loaded
+					 */
+					scan->finished = true;
+					return InvalidBlockNumber;
+				}
+			}
+
+			/* Position prefetchPos to the start of new prefetchBatch */
+			index_batchpos_newbatch(prefetchBatch, prefetchPos, direction);
+		}
+
+		/*
+		 * prefetchPos now points to the next item whose TID's heap block
+		 * number might need to be prefetched
+		 */
+		batch_assert_pos_valid(scan, prefetchPos);
+		Assert(INDEX_SCAN_BATCH(scan, prefetchPos->batch) == prefetchBatch);
+		Assert(prefetchBatch->dir == direction);
+
+		/* scanPos is always <= prefetchPos when we return */
+		Assert(scanPos->batch < prefetchPos->batch ||
+			   (scanPos->batch == prefetchPos->batch &&
+				ScanDirectionIsForward(direction) ?
+				scanPos->item <= prefetchPos->item :
+				scanPos->item >= prefetchPos->item));
+
+		item = &prefetchBatch->items[prefetchPos->item];
+		prefetchBlock = ItemPointerGetBlockNumber(&item->heapTid);
+
+		if (scan->xs_want_itup && item->allVisible)
+		{
+			/* item is known to be all-visible; prefetching isn't required */
+			continue;
+		}
+
+		if (prefetchBlock == batchringbuf->currentPrefetchBlock)
+		{
+			/*
+			 * prefetchBlock matches the last prefetchPos item's TID's heap
+			 * block number; we must not return the same prefetchBlock twice
+			 * (twice in succession)
+			 */
+			continue;
+		}
+
+		/* We have a new heap block number to return to read stream */
+		batchringbuf->currentPrefetchBlock = prefetchBlock;
+		return prefetchBlock;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index d5b8ce6cd..2e0c8ed2a 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -59,11 +59,14 @@ index_batchscan_init(IndexScanDesc scan)
 	/* positions in the ring buffer of batches */
 	batch_reset_pos(&scan->batchringbuf->scanPos);
 	batch_reset_pos(&scan->batchringbuf->markPos);
+	batch_reset_pos(&scan->batchringbuf->prefetchPos);
 
 	scan->batchringbuf->markBatch = NULL;
 	scan->batchringbuf->headBatch = 0;	/* initial head batch */
 	scan->batchringbuf->nextBatch = 0;	/* initial batch starts empty */
 	memset(&scan->batchringbuf->cache, 0, sizeof(scan->batchringbuf->cache));
+	scan->batchringbuf->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchringbuf->paused = false;
 }
 
 /*
@@ -82,8 +85,12 @@ index_batchscan_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	Assert(scan->xs_heapfetch);
 
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
 	/* reset the positions */
 	batch_reset_pos(&batchringbuf->scanPos);
+	batch_reset_pos(&batchringbuf->prefetchPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -126,6 +133,8 @@ index_batchscan_reset(IndexScanDesc scan, bool complete)
 	batchringbuf->nextBatch = 0;	/* initial batch is empty */
 
 	scan->finished = false;
+	batchringbuf->currentPrefetchBlock = InvalidBlockNumber;
+	batchringbuf->paused = false;
 
 	batch_assert_batches_valid(scan);
 }
@@ -215,9 +224,13 @@ index_batchscan_restore_pos(IndexScanDesc scan)
 {
 	BatchRingBuffer *batchringbuf = scan->batchringbuf;
 	BatchRingItemPos *markPos = &batchringbuf->markPos;
-	BatchRingItemPos *scanPos = &batchringbuf->scanPos ;
 	IndexScanBatch markBatch = batchringbuf->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with prefetchPos are fully understood.
+	 */
+#if 0
 	if (scanPos->batch == markPos->batch &&
 		scanPos->batch == batchringbuf->headBatch)
 	{
@@ -228,6 +241,7 @@ index_batchscan_restore_pos(IndexScanDesc scan)
 		scanPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 16bf1f61a..78f4b90d3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -145,6 +145,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7c60b1255..a99aa41db 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -891,6 +891,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fad34da6..05a315a55 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5631,6 +5631,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexscan-prefetch" xreflabel="enable_indexscan_prefetch">
+      <term><varname>enable_indexscan_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexscan_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables prefetching for index-scan and index-only-scan
+        plan types.  Prefetching can improve performance by reading table AM
+        pages ahead of when they are needed during index scans.  The default
+        is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..20ad20842 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -167,10 +167,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -749,6 +750,184 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>IndexScanBatch</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples in the given
+   scan direction.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface is an alternative to
+   <function>amgettuple</function> that returns matching index entries in batches
+   rather than one at a time. This enables the table access method to
+   optimize table block access patterns and perform I/O prefetching.
+   By returning all matching index entries from a single index page together,
+   the table AM can readahead through the index and identify which table
+   blocks will be needed, allowing prefetching of table AM pages during
+   ordered index scans.
+  </para>
+
+  <para>
+   The table AM passes the batch most recently returned by
+   <function>amgetbatch</function> for the given scan as
+   <literal>priorbatch</literal> (or <literal>NULL</literal> on the first call
+   for the scan).  The index AM uses information from <literal>priorbatch</literal>
+   to determine which index page to read next.
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is associated with a
+   pinned index page containing at least one matching item/tuple.  The buffer
+   pin can be held onto by the table AM as an interlock against concurrent TID
+   recycling by <command>VACUUM</command>.  See <xref linkend="index-locking"/>
+   for details on buffer pin management during index scans.
+  </para>
+
+  <para>
+   A <type>IndexScanBatch</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.  The core
+   <filename>src/backend/access/nbtree/</filename> and
+   <filename>src/backend/access/hash/</filename> implementations provide
+   reference examples of the <function>amgetbatch</function> interface.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+   Index access methods using <function>amgetbatch</function> must set
+   <literal>scan-&gt;xs_recheck</literal> to indicate whether rechecking of
+   scan keys is required, in the same way as <function>amgettuple</function>
+   does. However, <literal>scan-&gt;xs_recheck</literal> must be set consistently
+   for an entire scan rather than varying on a per-tuple basis. This is a key
+   difference from <function>amgettuple</function>, which can set
+   <literal>scan-&gt;xs_recheck</literal> independently for each tuple it returns.
+   Index access methods that require granular control over
+   <literal>scan-&gt;xs_recheck</literal> must use the <function>amgettuple</function>
+   interface instead of <function>amgetbatch</function>.
+  </para>
+
+  <para>
+   Similarly, the <function>amgetbatch</function> interface does not support
+   index-only scans that return data in the form of a
+   <structname>HeapTuple</structname> pointer.  Index-only scans work by
+   copying <structname>IndexTuple</structname> records from index pages into a
+   local buffer associated with each batch.  <literal>xs_itupdesc</literal>
+   works in the same as already described for <function>amgettuple</function>.
+   The access method must not set the <literal>scan-&gt;xs_itup</literal> or
+   <literal>scan-&gt;xs_hitup</literal> fields itself.
+  </para>
+
+  <para>
+   The index access method must provide either <function>amgettuple</function>
+   or <function>amgetbatch</function>, but not both.
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             IndexScanBatch batch);
+</programlisting>
+   Frees a batch returned by the <function>amgetbatch</function> callback.
+   The <literal>batch</literal> argument is associated with an index page,
+   which will never be locked when <function>amfreebatch</function> is called.
+   The batch might still retain a buffer pin on the page by the time that
+   <function>amfreebatch</function> is called (whether or not it does is
+   decided by the table AM).
+  </para>
+
+  <para>
+   <function>amfreebatch</function> releases the buffer pin on the batch's
+   associated index page (if the table AM has not already released it), and
+   frees related memory and resources.  It must always release the caller's
+   batch last, by passing it as an argument to
+   <function>indexam_util_batch_release</function>.
+  </para>
+
+  <para>
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.  The table access method controls when the
+   buffer pin is dropped: for MVCC snapshot scans the pin is dropped eagerly
+   after the batch is returned, while for non-MVCC snapshot scans the pin is
+   retained until <function>amfreebatch</function> is called.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin. While
+   this is optional, implementing it is recommended for performance, as it
+   allows future scans to skip known-dead index entries. Both core index access
+   methods that currently support <function>amgetbatch</function> (B-tree
+   and hash) implement <literal>LP_DEAD</literal> marking, though third-party
+   index access methods are free to choose whether to implement this feature.
+   The table AM may call
+   <function>tableam_util_kill_scanpositem</function> to mark dead items as
+   the scan progresses. If the batch contains any such dead items, the batch's
+   <structfield>killedItems</structfield> array will have been sorted and
+   deduplicated before <function>amfreebatch</function> is called, with item
+   offsets appearing in ascending order (that is, in index page order, which
+   is also batch order) and no offset appearing more than once. This sorting
+   makes it unnecessary for the table AM to call
+   <function>tableam_util_kill_scanpositem</function> in any particular order.
+   (Index access methods using <function>amgettuple</function> rely on the
+   <structfield>kill_prior_tuple</structfield> mechanism instead to mark dead
+   tuples; the <filename>src/backend/access/gist/</filename> implementation
+   provides a reference example.)
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins when this serves an
+   internal purpose (for example, maintaining a descent stack of pinned index
+   pages for reuse across <function>amgetbatch</function> calls).  However,
+   any scheme that retains buffer pins managed by the index AM must be sure to
+   free the pins at an opportune point (for example when <function>amrescan</function>
+   and/or <function>amendscan</function> are called).  It must also keep the
+   number of retained pins fixed and small, to avoid exhausting the backend's
+   buffer pin limit.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +947,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -795,32 +974,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -975,6 +1147,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        Access methods that always return entries in the natural ordering
        of their data (such as btree) should set
        <structfield>amcanorder</structfield> to true.
+       Both <function>amgettuple</function> and <function>amgetbatch</function>
+       scans support this capability.
        Currently, such access methods must use btree-compatible strategy
        numbers for their equality and ordering operators.
       </para>
@@ -987,41 +1161,48 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
        <replaceable>operator</replaceable> <replaceable>constant</replaceable>.  Scan modifiers
        of that form can be passed to <function>amrescan</function> as described
-       previously.
+       previously.  Note that <function>amgetbatch</function> scans do not
+       support ordering operators.
       </para>
      </listitem>
     </itemizedlist>
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  When a marked position is
+   restored, the index AM is notified via the <function>amposreset</function>
+   callback so it can invalidate any private state that independently tracks
+   the scan's progress (such as array key state).  See the
+   <function>amposreset</function> function description for details.
   </para>
 
   <para>
-   Both the scan position and the mark position (if any) must be maintained
+   The scan position (if any) must be maintained by the table AM and index AM
    consistently in the face of concurrent insertions or deletions in the
    index.  It is OK if a freshly-inserted entry is not returned by a scan that
    would have found the entry if it had existed when the scan started, or for
@@ -1044,12 +1225,14 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   Instead of using <function>amgettuple</function>, an index scan can be done with
-   <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
-   noticeably more efficient than <function>amgettuple</function> because it allows
-   avoiding lock/unlock cycles within the access method.  In principle
-   <function>amgetbitmap</function> should have the same effects as repeated
-   <function>amgettuple</function> calls, but we impose several restrictions to
+   Instead of using <function>amgettuple</function> or
+   <function>amgetbatch</function>, an index scan can be done with
+   <function>amgetbitmap</function> to fetch all tuples in one call.  This can
+   be noticeably more efficient than with an <quote>ordered</quote> scan
+   because it allows efficient sequential access to table AM pages containing
+   matches.  In principle <function>amgetbitmap</function> should have the
+   same effects as repeated <function>amgettuple</function> or
+   <function>amgetbatch</function> calls, but we impose several restrictions to
    simplify matters.  First of all, <function>amgetbitmap</function> returns all
    tuples at once and marking or restoring scan positions isn't
    supported. Secondly, the tuples are returned in a bitmap which doesn't
@@ -1066,10 +1249,64 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
 
   <para>
    Note that it is permitted for an access method to implement only
-   <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
-   if its internal implementation is unsuited to one API or the other.
+   <function>amgetbitmap</function> and not <function>amgettuple</function>/<function>amgetbatch</function>,
+   or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
+  <sect2 id="index-scanning-batches">
+   <title>Table AM Considerations for Batch Scanning</title>
+
+   <para>
+    This section is primarily relevant to table access method authors.
+    When an index scan uses the <function>amgetbatch</function> interface,
+    the table AM is responsible for managing position state within
+    <structfield>scan-&gt;batchRingBuffer</structfield> and for controlling when
+    buffer pins on index pages are released.
+   </para>
+
+   <para>
+    The <structfield>scan-&gt;batchRingBuffer.scanPos</structfield> field tracks the
+    current read position within the ring buffer of batches.  The table AM
+    must advance <structfield>scanPos</structfield> as tuples are returned
+    by <function>table_index_getnext_slot</function>.  The core code may also
+    modify this field during operations such as mark/restore.
+   </para>
+
+   <para>
+    The <structfield>scan-&gt;batchRingBuffer.prefetchPos</structfield> field tracks
+    the position for I/O prefetching.  It is generally advanced by initializing
+    it from <structfield>readPos</structfield> within a read stream callback,
+    allowing the table AM to prefetch table blocks pointed to by items that are
+    well ahead of the current scan position.  Initially
+    <structfield>prefetchPos</structfield> starts at
+    <structfield>scanPos</structfield>, but as the read stream ramps up it can
+    get far ahead &mdash; spanning multiple index pages if necessary to
+    maintain an optimal I/O prefetch distance for table block reads.  A major goal
+    of the <function>amgetbatch</function> interface is to allow the table AM
+    to prefetch without being limited to items from the current
+    <structfield>scanPos</structfield> index leaf page.
+   </para>
+
+   <para>
+    Both <structfield>scanPos</structfield> and
+    <structfield>prefetchPos</structfield> are controlled by the table AM and
+    core code; index access methods should not access or manipulate these
+    fields.  See the <filename>src/backend/access/heap/</filename>
+    implementation for a reference example.
+   </para>
+
+   <para>
+    Buffer pins on index pages returned by <function>amgetbatch</function> are
+    managed by the table AM.  The table AM decides when to drop index page
+    pins based on whether the scan uses an MVCC-compliant snapshot.  For MVCC
+    snapshot scans, pins are dropped eagerly; for non-MVCC snapshot scans,
+    pins are retained until <function>amfreebatch</function> is called.
+    See the <function>amgetbatch</function> and <function>amfreebatch</function>
+    descriptions in <xref linkend="index-functions"/> for details.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-locking">
@@ -1123,11 +1360,13 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
      </listitem>
      <listitem>
       <para>
-       An index scan must maintain a pin
-       on the index page holding the item last returned by
-       <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
-       entries from pages that are pinned by other backends.  The need
-       for this rule is explained below.
+       For <function>amgettuple</function> scans, the index access method must
+       maintain a pin on the index page holding the item last returned, and
+       <function>ambulkdelete</function> cannot delete entries from pages that
+       are pinned by other backends.  For <function>amgetbatch</function> scans,
+       the table access method controls when pins are dropped (see
+       <xref linkend="index-scanning-batches"/>).  The need for this rule is
+       explained below.
       </para>
      </listitem>
     </itemizedlist>
@@ -1173,6 +1412,29 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    it is only safe to use such scans with MVCC-compliant snapshots.
   </para>
 
+  <para>
+   The <function>amgetbatch</function> interface provides a different approach:
+   the table access method receives batches of TIDs and controls when index
+   page pins are dropped.  Because <function>amgetbatch</function> reads
+   multiple index leaf pages ahead to facilitate I/O prefetching of table
+   blocks, it cannot practically hold pins on all those pages simultaneously.
+   Therefore, like <function>amgetbitmap</function>, I/O prefetching with
+   <function>amgetbatch</function> is only possible when an MVCC-compliant
+   snapshot is in use.  In practice, the heap table AM (and any table AM
+   with similar concurrency rules) drops pins eagerly for MVCC snapshot scans
+   but retains pins for non-MVCC snapshot scans.  Index access methods that
+   implement <function>amgetbatch</function> do not control when pins are
+   dropped; that decision is delegated to the table AM.
+  </para>
+
+  <para>
+   Index access methods that use <function>amgettuple</function> must implement
+   the pin-holding behavior themselves.  Such index AMs are expected to hold
+   onto the leaf page buffer pin for non-MVCC snapshot scans, replicating
+   the behavior that the heap table AM would use with
+   <function>amgetbatch</function>.
+  </para>
+
   <para>
    When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
    index access method within a serializable transaction will acquire a
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3dd63fd88..b5628736b 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0



  [application/x-patch] v8-0001-Extract-fake-LSN-infrastructure-from-GiST-index-A.patch (16.8K, 6-v8-0001-Extract-fake-LSN-infrastructure-from-GiST-index-A.patch)
  download | inline diff:
From 9bb1913dd2a7047df3dba9a28d04b5a56c185a4a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:18:11 -0500
Subject: [PATCH v8 1/8] Extract fake LSN infrastructure from GiST index AM.

Extract utility functions used by GiST to generate fake LSNs so that
other index AMs can reuse this infrastructure to generate fake LSNs.

Preparation for an upcoming commit that will change the rules around
holding on to buffer pins on leaf pages in unlogged nbtree indexes
(actually, in all cases barring scans that use a non-MVCC snapshot).
This is the patch that will add the new amgetbatch interface.  Another
preparatory commit will add fake LSN support to nbtree ahead of the
amgetbatch commit.

Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.

Author: Peter Geoghegan <[email protected]>
---
 src/include/access/gist_private.h       |  4 --
 src/include/access/gistxlog.h           |  2 +-
 src/include/access/xlog.h               |  1 +
 src/include/access/xloginsert.h         |  2 +
 src/include/catalog/pg_control.h        |  2 +-
 src/backend/access/gist/gist.c          |  6 +--
 src/backend/access/gist/gistutil.c      | 50 -------------------
 src/backend/access/gist/gistvacuum.c    |  8 +--
 src/backend/access/gist/gistxlog.c      | 21 --------
 src/backend/access/rmgrdesc/gistdesc.c  |  6 ---
 src/backend/access/rmgrdesc/xlogdesc.c  |  7 +++
 src/backend/access/transam/xlog.c       | 29 +++++++++++
 src/backend/access/transam/xloginsert.c | 65 +++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c     | 14 +++---
 14 files changed, 120 insertions(+), 97 deletions(-)

diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 552f605c0..44514f1cb 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -457,8 +457,6 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								BlockNumber origrlink, GistNSN orignsn,
 								Buffer leftchildbuf, bool markfollowright);
 
-extern XLogRecPtr gistXLogAssignLSN(void);
-
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
@@ -531,8 +529,6 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
-
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index d3d1c6549..1c2cf6e81 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,7 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX		 0x50 */	/* not used anymore */
 #define XLOG_GIST_PAGE_DELETE		0x60
-#define XLOG_GIST_ASSIGN_LSN		0x70	/* nop, assign new LSN */
+ /* #define XLOG_GIST_ASSIGN_LSN		 0x70 */	/* not used anymore */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0591a885d..b05efe7c7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -257,6 +257,7 @@ extern bool CreateRestartPoint(int flags);
 extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
+extern XLogRecPtr XLogAssignLSN(void);
 extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 16ebc76e7..91dfbd562 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -64,6 +64,8 @@ extern void log_newpage_range(Relation rel, ForkNumber forknum,
 							  BlockNumber startblk, BlockNumber endblk, bool page_std);
 extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
 
+extern XLogRecPtr XLogGetFakeLSN(Relation rel);
+
 extern void InitXLogInsert(void);
 
 #endif							/* XLOGINSERT_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 7503db1af..77a661e81 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,7 +78,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY			0x90
 #define XLOG_FPI_FOR_HINT				0xA0
 #define XLOG_FPI						0xB0
-/* 0xC0 is used in Postgres 9.5-11 */
+#define XLOG_ASSIGN_LSN					0xC0
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
 #define XLOG_CHECKPOINT_REDO			0xE0
 #define XLOG_LOGICAL_DECODING_STATUS_CHANGE	0xF0
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d5944205d..683ebca52 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -517,7 +517,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = XLogGetFakeLSN(rel);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -594,7 +594,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = XLogGetFakeLSN(rel);
 		}
 		PageSetLSN(page, recptr);
 
@@ -1733,7 +1733,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, XLogGetFakeLSN(rel));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 27972fad2..0f58f6187 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1007,56 +1007,6 @@ gistproperty(Oid index_oid, int attno,
 	return true;
 }
 
-/*
- * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
- * splits anyway. This function provides a fake sequence of LSNs for that
- * purpose.
- */
-XLogRecPtr
-gistGetFakeLSN(Relation rel)
-{
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
-	{
-		/*
-		 * Temporary relations are only accessible in our session, so a simple
-		 * backend-local counter will do.
-		 */
-		static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
-		return counter++;
-	}
-	else if (RelationIsPermanent(rel))
-	{
-		/*
-		 * WAL-logging on this relation will start after commit, so its LSNs
-		 * must be distinct numbers smaller than the LSN at the next commit.
-		 * Emit a dummy WAL record if insert-LSN hasn't advanced after the
-		 * last call.
-		 */
-		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
-		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
-
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
-
-		/* No need for an actual record if we already have a distinct LSN */
-		if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
-			currlsn = gistXLogAssignLSN();
-
-		lastlsn = currlsn;
-		return currlsn;
-	}
-	else
-	{
-		/*
-		 * Unlogged relations are accessible from other backends, and survive
-		 * (clean) restarts. GetFakeLSNForUnloggedRel() handles that for us.
-		 */
-		Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
-		return GetFakeLSNForUnloggedRel();
-	}
-}
-
 /*
  * This is a stratnum translation support function for GiST opclasses that use
  * the RT*StrategyNumber constants.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 9e714980d..686a04180 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,7 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
-#include "access/transam.h"
+#include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -182,7 +182,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = XLogGetFakeLSN(rel);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -413,7 +413,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, XLogGetFakeLSN(rel));
 
 			END_CRIT_SECTION();
 
@@ -707,7 +707,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = XLogGetFakeLSN(info->index);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index c78383849..ae538dc81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -421,9 +421,6 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_PAGE_DELETE:
 			gistRedoPageDelete(record);
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			/* nop. See gistGetFakeLSN(). */
-			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -567,24 +564,6 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
 	return recptr;
 }
 
-/*
- * Write an empty XLOG record to assign a distinct LSN.
- */
-XLogRecPtr
-gistXLogAssignLSN(void)
-{
-	int			dummy = 0;
-
-	/*
-	 * Records other than XLOG_SWITCH must have content. We use an integer 0
-	 * to follow the restriction.
-	 */
-	XLogBeginInsert();
-	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData(&dummy, sizeof(dummy));
-	return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
-}
-
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 79a839cc2..67789e025 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,9 +80,6 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_GIST_PAGE_DELETE:
 			out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			/* No details to write out */
-			break;
 	}
 }
 
@@ -108,9 +105,6 @@ gist_identify(uint8 info)
 		case XLOG_GIST_PAGE_DELETE:
 			id = "PAGE_DELETE";
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			id = "ASSIGN_LSN";
-			break;
 	}
 
 	return id;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index ff078f222..9044b9521 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -175,6 +175,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		memcpy(&enabled, rec, sizeof(bool));
 		appendStringInfoString(buf, enabled ? "true" : "false");
 	}
+	else if (info == XLOG_ASSIGN_LSN)
+	{
+		/* no further information to print */
+	}
 }
 
 const char *
@@ -229,6 +233,9 @@ xlog_identify(uint8 info)
 		case XLOG_LOGICAL_DECODING_STATUS_CHANGE:
 			id = "LOGICAL_DECODING_STATUS_CHANGE";
 			break;
+		case XLOG_ASSIGN_LSN:
+			id = "ASSIGN_LSN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81dc86847..c8db33d27 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8228,6 +8228,31 @@ XLogRestorePoint(const char *rpName)
 	return RecPtr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ *
+ * This is used by index AMs (nbtree, hash, GiST) when building indexes on
+ * permanent relations with wal_level=minimal.  In that scenario, WAL-logging
+ * will start after commit, but the index AM needs distinct LSNs to detect
+ * concurrent page modifications.  When the current WAL insert position hasn't
+ * advanced since the last call, we emit this dummy record to ensure we get a
+ * new, distinct LSN.
+ */
+XLogRecPtr
+XLogAssignLSN(void)
+{
+	int			dummy = 0;
+
+	/*
+	 * Records other than XLOG_SWITCH must have content.  We use an integer 0
+	 * to satisfy this restriction.
+	 */
+	XLogBeginInsert();
+	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+	XLogRegisterData(&dummy, sizeof(dummy));
+	return XLogInsert(RM_XLOG_ID, XLOG_ASSIGN_LSN);
+}
+
 /*
  * Check if any of the GUC parameters that are critical for hot standby
  * have changed, and update the value in pg_control file if necessary.
@@ -8595,6 +8620,10 @@ xlog_redo(XLogReaderState *record)
 	{
 		/* nothing to do here, handled in xlogrecovery.c */
 	}
+	else if (info == XLOG_ASSIGN_LSN)
+	{
+		/* nothing to do here, see XLogGetFakeLSN() */
+	}
 	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
 	{
 		/*
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 92c48e768..ff83ab5f4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -547,6 +547,71 @@ XLogSimpleInsertInt64(RmgrId rmid, uint8 info, int64 value)
 	return XLogInsert(rmid, info);
 }
 
+/*
+ * XLogGetFakeLSN - get a fake LSN for an index page that isn't WAL-logged.
+ *
+ * Some index AMs (nbtree, hash, GiST) use LSNs to detect concurrent page
+ * modifications, but not all index pages are WAL-logged.  This function
+ * provides a sequence of fake LSNs for that purpose.
+ *
+ * The behavior depends on the relation's persistence:
+ *
+ * - For temporary relations, we use a simple backend-local counter since
+ *   temporary relations are only accessible within our session.
+ *
+ * - For permanent relations when WAL-logging is disabled (e.g., during index
+ *   creation with wal_level=minimal), we use the current WAL insert position.
+ *   If the insert position hasn't advanced since the last call, we emit a
+ *   dummy WAL record via XLogAssignLSN() to ensure we get a distinct LSN.
+ *
+ * - For unlogged relations, we use the global fake LSN counter maintained
+ *   by GetFakeLSNForUnloggedRel().
+ */
+XLogRecPtr
+XLogGetFakeLSN(Relation rel)
+{
+	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	{
+		/*
+		 * Temporary relations are only accessible in our session, so a simple
+		 * backend-local counter will do.
+		 */
+		static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
+		return counter++;
+	}
+	else if (RelationIsPermanent(rel))
+	{
+		/*
+		 * WAL-logging on this relation will start after commit, so its LSNs
+		 * must be distinct numbers smaller than the LSN at the next commit.
+		 * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+		 * last call.
+		 */
+		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
+
+		/* Shouldn't be called for WAL-logging relations */
+		Assert(!RelationNeedsWAL(rel));
+
+		/* No need for an actual record if we already have a distinct LSN */
+		if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
+			currlsn = XLogAssignLSN();
+
+		lastlsn = currlsn;
+		return currlsn;
+	}
+	else
+	{
+		/*
+		 * Unlogged relations are accessible from other backends, and survive
+		 * (clean) restarts.  GetFakeLSNForUnloggedRel() handles that for us.
+		 */
+		Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
+		return GetFakeLSNForUnloggedRel();
+	}
+}
+
 /*
  * Assemble a WAL record from the registered data and buffers into an
  * XLogRecData chain, ready for insertion with XLogInsertRecord().
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f935648a..275cd4032 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4469,13 +4469,13 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
 	 * LSNs since we never emit WAL records for them, and therefore flushing
 	 * up through the buffer LSN would be useless, but harmless.  However,
-	 * GiST indexes use LSNs internally to track page-splits, and therefore
-	 * unlogged GiST pages bear "fake" LSNs generated by
-	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
-	 * LSN counter could advance past the WAL insertion point; and if it did
-	 * happen, attempting to flush WAL through that location would fail, with
-	 * disastrous system-wide consequences.  To make sure that can't happen,
-	 * skip the flush if the buffer isn't permanent.
+	 * some index AMs (nbtree, hash, GiST) use LSNs internally to detect
+	 * concurrent page modifications, and therefore unlogged index pages bear
+	 * "fake" LSNs generated by XLogGetFakeLSN.  It is unlikely but possible
+	 * that the fake LSN counter could advance past the WAL insertion point;
+	 * and if it did happen, attempting to flush WAL through that location
+	 * would fail, with disastrous system-wide consequences.  To make sure
+	 * that can't happen, skip the flush if the buffer isn't permanent.
 	 */
 	if (buf_state & BM_PERMANENT)
 		XLogFlush(recptr);
-- 
2.51.0



  [application/x-patch] v8-0004-Introduce-read_stream_-pause-resume-yield.patch (11.8K, 7-v8-0004-Introduce-read_stream_-pause-resume-yield.patch)
  download | inline diff:
From 2f5c569f70fb83df5e7acf68d53af75871074abd Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Tue, 13 Jan 2026 20:44:14 +0100
Subject: [PATCH v8 4/8] Introduce read_stream_{pause,resume,yield}().

Author: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
---
 src/include/storage/read_stream.h            |   3 +
 src/backend/storage/aio/read_stream.c        |  50 ++++-
 src/test/modules/test_aio/Makefile           |   3 +-
 src/test/modules/test_aio/meson.build        |   1 +
 src/test/modules/test_aio/t/001_aio.pl       |  30 +++
 src/test/modules/test_aio/test_aio--1.0.sql  |  13 ++
 src/test/modules/test_aio/test_read_stream.c | 181 +++++++++++++++++++
 src/tools/pgindent/typedefs.list             |   1 +
 8 files changed, 280 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/test_read_stream.c

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index f2a0cc79c..e3b6bb2f3 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -99,6 +99,9 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 												   ReadStreamBlockNumberCB callback,
 												   void *callback_private_data,
 												   size_t per_buffer_data_size);
+extern BlockNumber read_stream_pause(ReadStream *stream);
+extern void read_stream_resume(ReadStream *stream);
+extern BlockNumber read_stream_yield(ReadStream *stream);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 88717c2ff..0dbec2abb 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -100,11 +100,13 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	int16		resume_distance;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
 	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
 	bool		advice_enabled;
 	bool		temporary;
+	bool		yielded;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -879,7 +881,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* End of stream reached?  */
 		if (stream->distance == 0)
-			return InvalidBuffer;
+		{
+			if (!stream->yielded)
+				return InvalidBuffer;
+
+			/* The callback yielded.  Resume. */
+			stream->yielded = false;
+			read_stream_resume(stream);
+			Assert(stream->distance != 0);
+		}
 
 		/*
 		 * The usual order of operations is that we look ahead at the bottom
@@ -1034,6 +1044,44 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 	return read_stream_get_block(stream, NULL);
 }
 
+/*
+ * Temporarily stop consuming block numbers from the block number callback.  If
+ * called inside the block number callback, its return value should be
+ * returned by the callback.
+ */
+BlockNumber
+read_stream_pause(ReadStream *stream)
+{
+	stream->resume_distance = stream->distance;
+	stream->distance = 0;
+	return InvalidBlockNumber;
+}
+
+/*
+ * Resume looking ahead after the block number callback reported end-of-stream.
+ * This is useful for streams of self-referential blocks, after a buffer needed
+ * to be consumed and examined to find more block numbers.
+ */
+void
+read_stream_resume(ReadStream *stream)
+{
+	stream->distance = stream->resume_distance;
+}
+
+/*
+ * Called from inside a block number callback, to return control to the caller
+ * of read_stream_next_buffer() without looking further ahead.  Its return
+ * value should be returned by the callback.  This is equivalent to pausing and
+ * resuming automatically at the next call to read_stream_next_buffer().
+ */
+BlockNumber
+read_stream_yield(ReadStream *stream)
+{
+	read_stream_pause(stream);
+	stream->yielded = true;
+	return InvalidBlockNumber;
+}
+
 /*
  * Reset a read stream by releasing any queued up buffers, allowing the stream
  * to be used again for different blocks.  This can be used to clear an
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
index f53cc6467..465eb09ee 100644
--- a/src/test/modules/test_aio/Makefile
+++ b/src/test/modules/test_aio/Makefile
@@ -5,7 +5,8 @@ PGFILEDESC = "test_aio - test code for AIO"
 MODULE_big = test_aio
 OBJS = \
 	$(WIN32RES) \
-	test_aio.o
+	test_aio.o \
+	test_read_stream.o
 
 EXTENSION = test_aio
 DATA = test_aio--1.0.sql
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
index fefa25bc5..d13fda219 100644
--- a/src/test/modules/test_aio/meson.build
+++ b/src/test/modules/test_aio/meson.build
@@ -2,6 +2,7 @@
 
 test_aio_sources = files(
   'test_aio.c',
+  'test_read_stream.c',
 )
 
 if host_system == 'windows'
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 5c634ec3c..5af558bf4 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -1489,6 +1489,35 @@ SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
 	$psql->quit();
 }
 
+# Read stream tests
+sub test_read_stream
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_read_stream(data int not null);
+INSERT INTO tmp_read_stream SELECT generate_series(1, 10000);
+SELECT test_read_stream_resume('tmp_read_stream', 0);
+DROP TABLE tmp_read_stream;
+));
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_read_stream(data int not null);
+INSERT INTO tmp_read_stream SELECT generate_series(1, 10000);
+SELECT test_read_stream_yield('tmp_read_stream', 0);
+DROP TABLE tmp_read_stream;
+));
+
+	$psql->quit();
+}
+
+
 
 # Run all tests that are supported for all io_methods
 sub test_generic
@@ -1525,6 +1554,7 @@ CHECKPOINT;
 	test_checksum($io_method, $node);
 	test_ignore_checksum($io_method, $node);
 	test_checksum_createdb($io_method, $node);
+	test_read_stream($io_method, $node);
 
   SKIP:
 	{
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c4..e37810b72 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -106,3 +106,16 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION inj_io_reopen_detach()
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Read stream related functions
+ */
+CREATE FUNCTION test_read_stream_resume(rel regclass, blockno int4)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_read_stream_yield(rel regclass, blockno int4)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_read_stream.c b/src/test/modules/test_aio/test_read_stream.c
new file mode 100644
index 000000000..d1d436a90
--- /dev/null
+++ b/src/test/modules/test_aio/test_read_stream.c
@@ -0,0 +1,181 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_read_stream.c
+ *		Helpers to write tests for read_stream.c
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
+
+typedef struct
+{
+	BlockNumber blkno;
+	int			count;
+} test_read_stream_resume_state;
+
+static BlockNumber
+test_read_stream_resume_cb(ReadStream *stream,
+						   void *callback_private_data,
+						   void *per_buffer_data)
+{
+	test_read_stream_resume_state *state = callback_private_data;
+
+	/* Periodic end-of-stream. */
+	if (++state->count % 3 == 0)
+		return read_stream_pause(stream);
+
+	return state->blkno;
+}
+
+/*
+ * Test read_stream_resume(), allowing a stream to end temporarily and then
+ * continue where it left off.
+ */
+PG_FUNCTION_INFO_V1(test_read_stream_resume);
+Datum
+test_read_stream_resume(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	ReadStream *stream;
+	test_read_stream_resume_state state = {.blkno = blkno};
+
+	rel = relation_open(relid, AccessShareLock);
+	stream = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										test_read_stream_resume_cb,
+										&state,
+										0);
+
+	for (int i = 0; i < 3; ++i)
+	{
+		/* Same block twice. */
+		buf = read_stream_next_buffer(stream, NULL);
+		Assert(BufferGetBlockNumber(buf) == blkno);
+		ReleaseBuffer(buf);
+		buf = read_stream_next_buffer(stream, NULL);
+		Assert(BufferGetBlockNumber(buf) == blkno);
+		ReleaseBuffer(buf);
+
+		/* End-of-stream. */
+		buf = read_stream_next_buffer(stream, NULL);
+		Assert(buf == InvalidBuffer);
+		buf = read_stream_next_buffer(stream, NULL);
+		Assert(buf == InvalidBuffer);
+
+		/* Resume. */
+		read_stream_resume(stream);
+	}
+
+	read_stream_end(stream);
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+typedef struct
+{
+	BlockNumber blkno;
+	int			count;
+	int			yields;
+	int			blocks;
+}			test_read_stream_yield_state;
+
+static BlockNumber
+test_read_stream_yield_cb(ReadStream *stream,
+						  void *callback_private_data,
+						  void *per_buffer_data)
+{
+	test_read_stream_yield_state *state = callback_private_data;
+
+	/* Yield every third call. */
+	if (++state->count % 3 == 2)
+	{
+		state->yields++;
+		return read_stream_yield(stream);
+	}
+
+	state->blocks++;
+	return state->blkno;
+}
+
+/*
+ * Test read_stream_yield(), allowing control to be yielded temporarily from
+ * the lookahead loop and returned to the caller of read_stream_next_buffer().
+ */
+PG_FUNCTION_INFO_V1(test_read_stream_yield);
+Datum
+test_read_stream_yield(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	ReadStream *stream;
+	test_read_stream_yield_state state = {.blkno = blkno};
+
+	rel = relation_open(relid, AccessShareLock);
+	stream = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										rel,
+										MAIN_FORKNUM,
+										test_read_stream_yield_cb,
+										&state,
+										0);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 1);
+	Assert(state.yields == 1);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 3);
+	Assert(state.yields == 1);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 3);
+	Assert(state.yields == 2);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 5);
+	Assert(state.yields == 2);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 5);
+	Assert(state.yields == 3);
+
+	buf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferGetBlockNumber(buf) == blkno);
+	ReleaseBuffer(buf);
+	Assert(state.blocks == 7);
+	Assert(state.yields == 3);
+
+	read_stream_end(stream);
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c3f2a9fc8..fa84abfc5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4195,6 +4195,7 @@ teSection
 temp_tablespaces_extra
 test128
 test_re_flags
+test_read_stream_resume_state
 test_regex_ctx
 test_shm_mq_header
 test_spec
-- 
2.51.0



  [application/x-patch] v8-0002-Add-fake-LSN-support-to-nbtree.patch (11.9K, 8-v8-0002-Add-fake-LSN-support-to-nbtree.patch)
  download | inline diff:
From 87d50424f56d33163120e6a654b312041e2b321c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:14:36 -0500
Subject: [PATCH v8 2/8] Add fake LSN support to nbtree.

This is preparation for an upcoming commit that will add the amgetbatch
interface and switch nbtree over to it (from amgettuple).  We need fake
LSNs to make it safe to apply the dropPin behavior to nbtree scans of
unlogged indexes (actually, to apply the behavior with any scan of an
unlogged index that uses the new amgetbatch interface, no matter the
index AM).

Currently, unlogged nbtree indexes need to hold on to a leaf page buffer
pin when stopped on that leaf page, purely so that the _bt_killitems
process has a way that it can be sure that there wasn't any unsafe
concurrent TID recycling by VACUUM.  The _bt_killitems' dropPin strategy
couldn't be used before now, since that works by checking if the page
LSN has changed in the period after _bt_readpage read the page's items,
but before _bt_killitems was called.  It is now possible to use the same
LSN trick with unlogged indexes (but we don't do that just yet).

The upcoming amgetbatch commit will make _bt_killitems check if the page
LSN has changed for logged and unlogged indexes alike.  In fact, it will
remove the need for index scans (including index-only scans) to hold on
to a leaf page buffer pin (assuming an index AM that uses the new
amgetbatch interface, and barring scans that use a non-MVCC snapshot).

Author: Peter Geoghegan <[email protected]>
---
 src/backend/access/nbtree/nbtdedup.c  |  8 ++-
 src/backend/access/nbtree/nbtinsert.c | 48 +++++++++-------
 src/backend/access/nbtree/nbtpage.c   | 82 +++++++++++++++------------
 3 files changed, 78 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 95be0b179..af7affdf4 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -69,6 +69,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 	Size		pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
 	bool		singlevalstrat = false;
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	XLogRecPtr	recptr;
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -245,7 +246,6 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_btree_dedup xlrec_dedup;
 
 		xlrec_dedup.nintervals = state->nintervals;
@@ -263,9 +263,11 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 							state->nintervals * sizeof(BTDedupInterval));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 63eda08f7..06abdf524 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1125,6 +1125,7 @@ _bt_insertonpg(Relation rel,
 	IndexTuple	oposting = NULL;
 	IndexTuple	origitup = NULL;
 	IndexTuple	nposting = NULL;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
@@ -1322,7 +1323,6 @@ _bt_insertonpg(Relation rel,
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
 			uint8		xlinfo;
-			XLogRecPtr	recptr;
 			uint16		upostingoff;
 
 			xlrec.offnum = newitemoff;
@@ -1395,14 +1395,16 @@ _bt_insertonpg(Relation rel,
 			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
-			if (BufferIsValid(metabuf))
-				PageSetLSN(metapg, recptr);
-			if (!isleaf)
-				PageSetLSN(BufferGetPage(cbuf), recptr);
-
-			PageSetLSN(page, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		if (BufferIsValid(metabuf))
+			PageSetLSN(metapg, recptr);
+		if (!isleaf)
+			PageSetLSN(BufferGetPage(cbuf), recptr);
+
+		PageSetLSN(page, recptr);
 
 		END_CRIT_SECTION();
 
@@ -1504,6 +1506,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	XLogRecPtr	recptr;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1983,7 +1986,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
-		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
@@ -2067,14 +2069,16 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 
 		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
-		PageSetLSN(origpage, recptr);
-		PageSetLSN(rightpage, recptr);
-		if (!isrightmost)
-			PageSetLSN(spage, recptr);
-		if (!isleaf)
-			PageSetLSN(BufferGetPage(cbuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(origpage, recptr);
+	PageSetLSN(rightpage, recptr);
+	if (!isrightmost)
+		PageSetLSN(spage, recptr);
+	if (!isleaf)
+		PageSetLSN(BufferGetPage(cbuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -2476,6 +2480,7 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2571,7 +2576,6 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_newroot xlrec;
-		XLogRecPtr	recptr;
 		xl_btree_metadata md;
 
 		xlrec.rootblk = rootblknum;
@@ -2605,11 +2609,13 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 							((PageHeader) rootpage)->pd_upper);
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT);
-
-		PageSetLSN(lpage, recptr);
-		PageSetLSN(rootpage, recptr);
-		PageSetLSN(metapg, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(lpage, recptr);
+	PageSetLSN(rootpage, recptr);
+	PageSetLSN(metapg, recptr);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4125c185e..70b524dfe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -235,6 +235,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -286,7 +287,6 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_metadata md;
-		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
@@ -303,9 +303,11 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 		XLogRegisterBufData(0, &md, sizeof(xl_btree_metadata));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_META_CLEANUP);
-
-		PageSetLSN(metapg, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(metapg, recptr);
 
 	END_CRIT_SECTION();
 
@@ -351,6 +353,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 	BlockNumber rootblkno;
 	uint32		rootlevel;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	Assert(access == BT_READ || heaprel != NULL);
 
@@ -473,7 +476,6 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 		if (RelationNeedsWAL(rel))
 		{
 			xl_btree_newroot xlrec;
-			XLogRecPtr	recptr;
 			xl_btree_metadata md;
 
 			XLogBeginInsert();
@@ -497,10 +499,12 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 			XLogRegisterData(&xlrec, SizeOfBtreeNewroot);
 
 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT);
-
-			PageSetLSN(rootpage, recptr);
-			PageSetLSN(metapg, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(rootpage, recptr);
+		PageSetLSN(metapg, recptr);
 
 		END_CRIT_SECTION();
 
@@ -1162,6 +1166,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	XLogRecPtr	recptr;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1226,7 +1231,6 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (needswal)
 	{
-		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
@@ -1248,9 +1252,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -1292,6 +1298,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	XLogRecPtr	recptr;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1342,7 +1349,6 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (needswal)
 	{
-		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
@@ -1366,9 +1372,11 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -2103,6 +2111,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2253,7 +2262,6 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_mark_page_halfdead xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.poffset = poffset;
 		xlrec.leafblk = leafblkno;
@@ -2274,12 +2282,14 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 		XLogRegisterData(&xlrec, SizeOfBtreeMarkPageHalfDead);
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD);
-
-		page = BufferGetPage(subtreeparent);
-		PageSetLSN(page, recptr);
-		page = BufferGetPage(leafbuf);
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	page = BufferGetPage(subtreeparent);
+	PageSetLSN(page, recptr);
+	page = BufferGetPage(leafbuf);
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -2337,6 +2347,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2676,7 +2687,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
 		uint8		xlinfo;
-		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
 
@@ -2720,25 +2730,25 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlinfo = XLOG_BTREE_UNLINK_PAGE;
 
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+	}
+	else
+		recptr = XLogGetFakeLSN(rel);
 
-		if (BufferIsValid(metabuf))
-		{
-			PageSetLSN(metapg, recptr);
-		}
-		page = BufferGetPage(rbuf);
+	if (BufferIsValid(metabuf))
+		PageSetLSN(metapg, recptr);
+	page = BufferGetPage(rbuf);
+	PageSetLSN(page, recptr);
+	page = BufferGetPage(buf);
+	PageSetLSN(page, recptr);
+	if (BufferIsValid(lbuf))
+	{
+		page = BufferGetPage(lbuf);
 		PageSetLSN(page, recptr);
-		page = BufferGetPage(buf);
+	}
+	if (target != leafblkno)
+	{
+		page = BufferGetPage(leafbuf);
 		PageSetLSN(page, recptr);
-		if (BufferIsValid(lbuf))
-		{
-			page = BufferGetPage(lbuf);
-			PageSetLSN(page, recptr);
-		}
-		if (target != leafblkno)
-		{
-			page = BufferGetPage(leafbuf);
-			PageSetLSN(page, recptr);
-		}
 	}
 
 	END_CRIT_SECTION();
-- 
2.51.0



  [application/x-patch] v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch (189.5K, 9-v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch)
  download | inline diff:
From 13ea58132e05a6fd2597a686c9963534368b7dce Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v8 3/8] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to perform I/O
prefetching.  Prefetching will be added by an upcoming commit.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface returns batches that hold a buffer pin on an index page that
can be used by the table AM as an interlock against concurrent TID
recycling by VACUUM.  In practice heapam only actually holds on to such
a pin for an instant, except during scans that use a non-MVCC snapshot,
where we continue to need to hold the pin until all of the batch's TIDs
have been fetched from the heap.

This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
and generalizes it to work with all index AMs that support the new
amgetbatch interface.  We'll always drop index page pins eagerly with
unlogged relations and during index-only scans now, provided the scan
uses an MVCC snapshot (the nbtree dropPin feature could do neither).
We're able to do this with unlogged relations by requiring that
amgetbatch index AMs use fake LSNs (or, alternatively, not support
LP_DEAD marking dead index tuples at all).  We're able to do this during
index-only scans due to heapam now saving information from the
visibility map up front, in a per-item batch cache.  Index-only scans
still need to hold onto the index page pin for as long as it takes to
populate the visibility cache, but that is a very fast process, and one
that never has to wait for some other process to yield.  (The need to
delay dropping index page pins until the batch's VM cache can be filled
like this is one reason why table AMs now fully control when and where
index page pins associated with a batch are dropped.)

The upcoming commit that adds index prefetching will use a read stream to
read heap pages during index scans.  Read stream is careful to limit how
many things it pins, lest we get errors about having too many buffers
pinned.  Simply never holding on to index page buffer pins greatly
simplifies resource management for index prefetching; there's no risk of
unintended interactions between the read stream and index AM.  (The only
downside is that we cannot support prefetching during scans that use a
non-MVCC snapshot, which seems quite acceptable.)

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  28 +-
 src/include/access/heapam.h                   |   3 +
 src/include/access/nbtree.h                   | 176 ++----
 src/include/access/relscan.h                  | 302 +++++++++-
 src/include/access/tableam.h                  |  41 ++
 src/include/executor/instrument_node.h        |   6 +
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 537 ++++++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  13 +-
 src/backend/access/index/indexam.c            | 144 ++---
 src/backend/access/index/indexbatch.c         | 544 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |   6 +-
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 195 +++----
 src/backend/access/nbtree/nbtree.c            | 301 +++-------
 src/backend/access/nbtree/nbtsearch.c         | 510 ++++++----------
 src/backend/access/nbtree/nbtutils.c          | 141 +----
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 108 +---
 src/backend/executor/nodeIndexscan.c          |  13 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 src/backend/utils/adt/selfuncs.c              |  68 +--
 contrib/bloom/blutils.c                       |   5 +-
 .../expected/index-only-scan-visibility.out   |  32 ++
 src/test/isolation/isolation_schedule         |   1 +
 .../specs/index-only-scan-visibility.spec     |  84 +++
 .../modules/dummy_index_am/dummy_index_am.c   |   5 +-
 src/tools/pgindent/typedefs.list              |  10 +-
 46 files changed, 2159 insertions(+), 1239 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c
 create mode 100644 src/test/isolation/expected/index-only-scan-visibility.out
 create mode 100644 src/test/isolation/specs/index-only-scan-visibility.spec

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index ecfbd017d..85d2746ec 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan,
+											   IndexScanBatch priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4c0429cc6..86690ab50 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -94,6 +94,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -154,6 +155,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool xs_want_itup,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -180,14 +182,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool xs_want_itup,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -251,4 +251,26 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+extern void index_batchscan_init(IndexScanDesc scan);
+extern void index_batchscan_reset(IndexScanDesc scan, bool complete);
+extern void index_batchscan_end(IndexScanDesc scan);
+extern void index_batchscan_mark_pos(IndexScanDesc scan);
+extern void index_batchscan_restore_pos(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by table AMs (in indexbatch.c)
+ */
+extern void tableam_util_kill_scanpositem(IndexScanDesc scan);
+extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch);
+extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab3..494f36ecc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,7 +117,10 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 772248596..5e01e2e7f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a ring buffer of "batches", where each batch represents the
+ * items returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan,
+								 IndexScanBatch priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076..ae85e8ddc 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,8 +16,10 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
+#include "storage/buf.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +126,179 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Location of a BatchMatchingItem that appears in a IndexScanBatch returned
+ * by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchRingItemPos
+{
+	/* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatch */
+	int			batch;
+
+	/* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchRingItemPos;
+
+static inline void
+batch_reset_pos(BatchRingItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned IndexScanBatch) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+	bool		allVisible;		/* TID points to all-visible page */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct IndexScanBatchData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight are used by index AMs to track whether there may
+	 * be matching index entries to the left and right currPage, respectively.
+	 *
+	 * Note: the exact interpretation of these fields varies across index AMs.
+	 * Table AMs must not rely on them directly; they must always call index
+	 * AM's amgetbatch routine to determine if there's no more batches in the
+	 * current scan direction.
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * Matching items state for this batch.  Output by index AM for table AM.
+	 *
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient for index AMs
+	 * to fill the array back-to-front, so we start at the last slot and fill
+	 * downwards.  Hence they need a first-valid-entry and a last-valid-entry
+	 * counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */
+} IndexScanBatchData;
+
+typedef struct IndexScanBatchData *IndexScanBatch;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchringbuf->nextBatch - (scan)->batchringbuf->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) >= (scan)->batchringbuf->headBatch && (idx) < (scan)->batchringbuf->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+( \
+	AssertMacro(INDEX_SCAN_BATCH_LOADED(scan, idx)), \
+	((scan)->batchringbuf->batches[(idx) % INDEX_SCAN_MAX_BATCHES]) \
+)
+
+/* Append given batch to scan's batch ring buffer */
+#define INDEX_SCAN_BATCH_APPEND(scan, batch) \
+	do { \
+		BatchRingBuffer *mringbuf = (scan)->batchringbuf;	\
+		int				nextBatch = mringbuf->nextBatch; \
+		mringbuf->batches[nextBatch % INDEX_SCAN_MAX_BATCHES] = (batch); \
+		mringbuf->nextBatch++; \
+	} while(0)
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans use a ring buffer of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ */
+typedef struct BatchRingBuffer
+{
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos markPos;	/* mark/restore position */
+
+	IndexScanBatch markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	IndexScanBatch cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to ring buffer batches */
+	IndexScanBatch batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchRingBuffer;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -141,6 +316,13 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* table access method's private amgetbatch state */
+	BatchRingBuffer *batchringbuf;	/* amgetbatch related state */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -149,9 +331,13 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	/* Safe to drop index page pins eagerly? */
+	bool		MVCCScan;
 
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/*
+	 * Did we read the final batch in this scan direction?
+	 */
+	bool		finished;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -176,6 +362,8 @@ typedef struct IndexScanDescData
 	IndexFetchTableData *xs_heapfetch;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	bool		xs_visible;		/* T means the heap page is all-visible */
+	uint16		maxitemsbatch;	/* set by ambeginscan when amgetbatch used */
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -215,4 +403,114 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Advance position to its next item in the batch.
+ *
+ * Advance to the next item within the provided batch (or to the previous item,
+ * when scanning backwards).
+ *
+ * Returns true if the position could be advanced.  Returns false when there
+ * are no more items in the batch in the given direction.
+ */
+static inline bool
+index_batchpos_advance(IndexScanBatch batch, BatchRingItemPos *pos,
+					   ScanDirection direction)
+{
+	Assert(!INDEX_SCAN_POS_INVALID(pos));
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item > batch->lastItem)
+			return false;
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item < batch->firstItem)
+			return false;
+	}
+
+	/* Advanced within batch */
+	return true;
+}
+
+/*
+ * Advance batch position start of its new batch.
+ *
+ * Sets the given position to the fist item in the given scan direction (or to
+ * the last item, when scanning backwards).   Also advances/increments batch
+ * offset from position such that it points to newBatchForPos.
+ */
+static inline void
+index_batchpos_newbatch(IndexScanBatch newBatchForPos, BatchRingItemPos *pos,
+						ScanDirection direction)
+{
+	Assert(newBatchForPos->dir == direction);
+
+	/* Next batch successfully loaded */
+	pos->batch++;
+	if (ScanDirectionIsForward(direction))
+		pos->item = newBatchForPos->firstItem;
+	else
+		pos->item = newBatchForPos->lastItem;
+
+	Assert(!INDEX_SCAN_POS_INVALID(pos));
+}
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchRingItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchringbuf->headBatch);
+	Assert(pos->batch < batchringbuf->nextBatch);
+	Assert(pos->item >= batch->firstItem);
+	Assert(pos->item <= batch->lastItem);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, IndexScanBatch batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchringbuf->headBatch >= 0 &&
+		   batchringbuf->headBatch <= batchringbuf->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchringbuf->headBatch; i < batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d..98e337972 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -433,11 +433,29 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -1188,6 +1206,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
+ * can consume index tuple results by examining IndexScanDescData fields such
+ * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
+ * into the provided slot in the case of xs_want_itup=true callers.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	return iscan->heapRelation->rd_tableam->index_getnext_slot(iscan,
+															   direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1249,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94..b5b8f509a 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -48,6 +48,12 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * heap blocks fetched counts (incremented by index_getnext_slot calls
+	 * within table AMs, though only during index-only scans)
+	 */
+	uint64		nheapfetches;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e5..793b1a3c6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1753,7 +1753,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1776,7 +1775,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 449885b93..3c81602b4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e4214..d5d01b877 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d205093e2..1263dc180 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 683ebca52..b0c67e3ad 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e88ddb32a..6a20b67f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d..493d6cb72 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include "access/amapi.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -81,10 +82,12 @@ heapam_slot_callbacks(Relation relation)
 static IndexFetchTableData *
 heapam_index_fetch_begin(Relation rel)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
+	hscan->vmbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +97,12 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +113,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -125,22 +136,28 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*call_again);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
+
+		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page when it is pinned for the first time
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -173,6 +190,499 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+static pg_noinline void
+heapam_batch_rewind(IndexScanDesc scan, BatchRingBuffer *batchringbuf,
+					ScanDirection direction)
+{
+	/*
+	 * Handle a change in the scan's direction.
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded.
+	 */
+	while (batchringbuf->nextBatch > batchringbuf->headBatch + 1)
+	{
+		/* release "later" batches in reverse order */
+		IndexScanBatch fbatch;
+
+		fbatch = INDEX_SCAN_BATCH(scan, batchringbuf->nextBatch - 1);
+		tableam_util_free_batch(scan, fbatch);
+		batchringbuf->nextBatch--;
+	}
+
+	/*
+	 * Remember the new direction, and make sure the scan is not marked as
+	 * "finished" (we might have already read the last batch, but now we need
+	 * to start over).
+	 */
+	batchringbuf->direction = direction;
+	scan->finished = false;
+}
+
+static inline ItemPointer
+heapam_batch_return_tid(IndexScanDesc scan, IndexScanBatch scanBatch,
+						BatchRingItemPos *scanPos)
+{
+	batch_assert_pos_valid(scan, scanPos);
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* set the TID / itup for the scan */
+	scan->xs_heaptid = scanBatch->items[scanPos->item].heapTid;
+
+	/* plain index scans will have flags left set to 0 */
+	scan->xs_visible = scanBatch->items[scanPos->item].allVisible;
+
+	if (scan->xs_want_itup)
+		scan->xs_itup =
+			(IndexTuple) (scanBatch->currTuples +
+						  scanBatch->items[scanPos->item].tupleOffset);
+
+	return &scan->xs_heaptid;
+}
+
+/*
+ * heap_batch_resolve_visibility
+ *		Obtain visibility information for every TID from caller's batch.
+ */
+static void
+heap_batch_resolve_visibility(IndexScanDesc scan, IndexScanBatch batch)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+
+	/*
+	 * Batch's buffer pin (the one held when amgetbatch returned) must still
+	 * be held when we're called.  It'll be released by our caller upon return
+	 * (our caller will definitely do this because index-only scans always use
+	 * an MVCC snapshot).
+	 *
+	 * Index vacuuming will block on acquiring a conflicting cleanup lock on
+	 * batch's index page due to our holding on to a pin on that same page.
+	 * Copying the relevant visibility map data into our local cache suffices
+	 * to prevent unsafe concurrent TID recycling: if any of these TIDs point
+	 * to dead heap tuples, VACUUM cannot possibly return from ambulkdelete
+	 * and mark the pointed-to heap pages as all-visible.  VACUUM _can_ do so
+	 * once our caller releases the batch's pin, but that's okay; we'll be
+	 * working off of cached visibility info that indicates that the dead TIDs
+	 * are NOT all-visible.  The subsequent heap fetches for these dead TIDs
+	 * will indicate that they're not dead-to-all, but that's okay; they won't
+	 * be visible to _our_ MVCC snapshot, so everything works out.
+	 */
+	Assert(BufferIsValid(batch->buf));
+
+	for (int i = batch->firstItem; i <= batch->lastItem; i++)
+	{
+		BatchMatchingItem *item = &batch->items[i];
+		ItemPointer tid = &item->heapTid;
+
+		if (VM_ALL_VISIBLE(scan->heapRelation,
+						   ItemPointerGetBlockNumber(tid),
+						   &hscan->vmbuf))
+		{
+			item->allVisible = true;
+		}
+	}
+}
+
+/* ----------------
+ *		heap_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called when we need to load the next batch of index entries to process in
+ * the given direction.
+ *
+ * Returns the next batch to be processed by the index scan, or NULL when
+ * there are no more matches in the given scan direction.  Also appends the
+ * returned batch to the end of the scan's batch ring buffer.
+ * ----------------
+ */
+static IndexScanBatch
+heap_batch_getnext(IndexScanDesc scan, IndexScanBatch priorbatch,
+				   ScanDirection direction)
+{
+	IndexScanBatch batch = NULL;
+	BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = scan->batchringbuf;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * When caller provides a priorbatch it had better be for the last valid
+	 * batch currently in the batch ring buffer (otherwise appending a new
+	 * batch would result batches that aren't in scan order, which is wrong).
+	 */
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	Assert(!priorbatch ||
+		   (INDEX_SCAN_BATCH_COUNT(scan) > 0 &&
+			INDEX_SCAN_BATCH(scan, batchringbuf->nextBatch - 1) == priorbatch));
+
+	if (scan->finished)
+		return NULL;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM */
+		Assert(batch->dir == direction);
+
+		if (scan->xs_want_itup)
+		{
+			/*
+			 * Index-only scan.  Eagerly fetch visibility info from visibility
+			 * map for all batch item TIDs.
+			 */
+			heap_batch_resolve_visibility(scan, batch);
+		}
+
+		/* Append batch to the end of ring buffer/write it to buffer index */
+		INDEX_SCAN_BATCH_APPEND(scan, batch);
+
+		/*
+		 * It's now safe to drop the batch's buffer pin, as we've resolved the
+		 * visibility status of all of its items (during index-only scans).
+		 * See heap_batch_resolve_visibility comments for an explanation.
+		 *
+		 * Note: We can't drop the pin here (we delay it until amfreebatch is
+		 * called for the batch) whenever the scan uses a non-MVCC snapshot.
+		 * This is explained fully in doc/src/sgml/indexam.sgml.
+		 */
+		Assert(scan->MVCCScan == IsMVCCSnapshot(scan->xs_snapshot));
+		if (scan->MVCCScan)
+		{
+			ReleaseBuffer(batch->buf);
+			batch->buf = InvalidBuffer;
+		}
+	}
+
+	/* xs_hitup is not supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	batch_assert_batches_valid(scan);
+
+	return batch;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from batch ring buffer
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	IndexScanBatch scanBatch = NULL;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* xs_hitup is not supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	/* Initialize direction on first call */
+	if (batchringbuf->direction == NoMovementScanDirection)
+		batchringbuf->direction = direction;
+
+	if (unlikely(batchringbuf->direction != direction))
+	{
+		/* We may change direction after reading the last batch. */
+		scan->finished = false;
+	}
+
+	/*
+	 * Try advancing the position in the current batch. If that doesn't
+	 * succeed, it means we don't have more items in it, and we need to
+	 * advance to the next one (in the new scan direction).
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, scanPos->batch))
+	{
+		scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch);
+
+		if (index_batchpos_advance(scanBatch, scanPos, direction))
+			return heapam_batch_return_tid(scan, scanBatch, scanPos);
+	}
+
+	if (unlikely(batchringbuf->direction != direction))
+	{
+		/* XXX shouldn't heapam_batch_rewind update the scanPos too? */
+		heapam_batch_rewind(scan, batchringbuf, direction);
+
+		/*
+		 * XXX It seems a bit weird to update just the batch part of the
+		 * scanPos. Doesn't it make it rather wrong, with the item still set
+		 * from the original batch? The next code block sets item too. So it
+		 * seems we're doing this only to "fake" the batch, and then the next
+		 * block will advance batch and reset the item. It's confusing, worth
+		 * documenting. Maybe we should set item=-1?
+		 */
+		scanPos->batch = batchringbuf->nextBatch - 1;
+	}
+
+	/*
+	 * Ran out of items from scanBatch.  Try to advance it to next batch.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, scanPos->batch + 1))
+	{
+		scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch + 1);
+	}
+	else if ((scanBatch = heap_batch_getnext(scan, scanBatch, direction)) != NULL)
+	{
+		/* Called amgetbatch again, loading new scanBatch into ring buffer */
+	}
+	else
+	{
+		/*
+		 * There are no more batches to be loaded in the current scan
+		 * direction.  Defensively reset the read position.
+		 */
+		batch_reset_pos(scanPos);
+		scan->finished = true;
+
+		return NULL;
+	}
+
+	/* Position scanPos to the start of new scanBatch */
+	index_batchpos_newbatch(scanBatch, scanPos, direction);
+	Assert(INDEX_SCAN_BATCH(scan, scanPos->batch) == scanBatch);
+
+	/* Free now-unneeded older batch/prior scanBatch */
+	if (scanPos->batch != batchringbuf->headBatch)
+	{
+		IndexScanBatch headBatch = INDEX_SCAN_BATCH(scan,
+													batchringbuf->headBatch);
+
+		/* Free the head batch (except when it's markBatch) */
+		tableam_util_free_batch(scan, headBatch);
+
+		/*
+		 * In any case, remove the batch from the ring buffer, even if we kept
+		 * it for mark/restore
+		 */
+		batchringbuf->headBatch++;
+
+		/* we can't skip any batches */
+		Assert(batchringbuf->headBatch == scanPos->batch);
+	}
+
+	return heapam_batch_return_tid(scan, scanBatch, scanPos);
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchringbuf)
+		{
+			if (all_dead)
+				tableam_util_kill_scanpositem(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchringbuf != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+			{
+				tid = index_getnext_tid(scan, direction);
+
+				/*
+				 * make sure to set the xs_visible flag, just like it's done
+				 * in heapam_batch_getnext_tid
+				 */
+				if (tid != NULL && scan->xs_want_itup)
+					scan->xs_visible = VM_ALL_VISIBLE(scan->heapRelation,
+													  ItemPointerGetBlockNumber(tid),
+													  &hscan->vmbuf);
+			}
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!scan->xs_visible)
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->nheapfetches++;
+
+				if (!index_fetch_heap(scan, slot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(scan->xs_heap_continue &&
+						 IsMVCCSnapshot(scan->xs_snapshot)));
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but it's not clear whether it's a win
+				 * to do so.  The next index entry might require a visit to
+				 * the same heap page.
+				 */
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1263,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1301,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2647,6 +3159,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index a29be6f46..556a72fb3 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchringbuf = NULL;	/* set later for amgetbatch callers */
+	scan->xs_want_itup = false; /* caller must initialize this */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -102,8 +104,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	else
 		scan->orderByData = NULL;
 
-	scan->xs_want_itup = false; /* may be set later */
-
 	/*
 	 * During recovery we ignore killed tuples and don't bother to kill them
 	 * either. We do this because the xmin on the primary node could easily be
@@ -446,7 +446,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +517,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +708,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +735,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ed0508c6..57da68d8a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool xs_want_itup,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -281,7 +280,13 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
+	scan->finished = false;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batchscan_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
@@ -312,6 +317,8 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
+	scan->finished = false;
 	scan->instrument = instrument;
 
 	return scan;
@@ -380,6 +387,16 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * batchringbuf shouldn't be marked finished (must make sure that
+	 * index_batchscan_reset doesn't see this, since if it is then
+	 * indexam_util_batch_release will be affected)
+	 */
+	scan->finished = false;
+
+	if (scan->batchringbuf)
+		index_batchscan_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +411,10 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	if (scan->batchringbuf)
+		index_batchscan_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +443,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	index_batchscan_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +470,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +480,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batchscan_restore_pos(scan);
 }
 
 /*
@@ -579,6 +602,9 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	if (scan->batchringbuf)
+		index_batchscan_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +617,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool xs_want_itup,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -612,7 +639,13 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	 */
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
+	scan->finished = false;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batchscan_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
@@ -621,10 +654,14 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +704,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..d5b8ce6cd
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,544 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batchscan_init - initialize fields for a batch index scan
+ *		index_batchscan_reset - reset state needed by a batch index scan
+ *		index_batchscan_end - free resources at end of batch index scan
+ *		index_batchscan_mark_pos - set a mark from scanPos position
+ *		index_batchscan_restore_pos - restore mark to scanPos position
+ *		tableam_util_kill_scanpositem - record that scanPos item is dead
+ *		tableam_util_free_batch - release resources associated with a batch
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate a new batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+#include "utils/memdebug.h"
+
+static int	batch_compare_int(const void *va, const void *vb);
+
+/*
+ * index_batchscan_init - initialize fields for a batch index scan.
+ *
+ * Sets up the batch ring buffer structure and its initial read position.
+ * Also determines whether the scan will eagerly drop index page pins.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+index_batchscan_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchringbuf = palloc_object(BatchRingBuffer);
+
+	/* Tracks scan direction used to return last item */
+	scan->batchringbuf->direction = NoMovementScanDirection;
+
+	/* positions in the ring buffer of batches */
+	batch_reset_pos(&scan->batchringbuf->scanPos);
+	batch_reset_pos(&scan->batchringbuf->markPos);
+
+	scan->batchringbuf->markBatch = NULL;
+	scan->batchringbuf->headBatch = 0;	/* initial head batch */
+	scan->batchringbuf->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchringbuf->cache, 0, sizeof(scan->batchringbuf->cache));
+}
+
+/*
+ * index_batchscan_reset - reset state used for a batch index scan
+ *
+ * Resets all loaded batches in the ring buffer, and resets the read position
+ * to the initial state (or just initialize ring buffer state).  When
+ * 'complete' is true, also frees the scan's marked batch (if any), which is
+ * useful when ending an amgetbatch-based index scan.
+ */
+void
+index_batchscan_reset(IndexScanDesc scan, bool complete)
+{
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+
+	batch_assert_batches_valid(scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchringbuf->scanPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the ring buffer), or by
+	 * explicitly freeing it.
+	 */
+	if (complete && unlikely(batchringbuf->markBatch != NULL))
+	{
+		BatchRingItemPos *markPos = &batchringbuf->markPos;
+		IndexScanBatch markBatch = batchringbuf->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchringbuf->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not loaded into
+		 * the ring buffer), free it explicitly now.  Otherwise, it'll be
+		 * freed along with the other loaded batches.
+		 */
+		if (!INDEX_SCAN_BATCH_LOADED(scan, markPos->batch))
+			tableam_util_free_batch(scan, markBatch);
+
+		batch_reset_pos(&batchringbuf->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchringbuf->headBatch < batchringbuf->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan,
+												batchringbuf->headBatch);
+
+		tableam_util_free_batch(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchringbuf->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchringbuf->headBatch = 0;	/* initial batch */
+	batchringbuf->nextBatch = 0;	/* initial batch is empty */
+
+	scan->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/*
+ * index_batchscan_end - free resources at end of batch index scan
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ */
+void
+index_batchscan_end(IndexScanDesc scan)
+{
+	/* Call amfreebatch and all remaining loaded batches (even markBatch) */
+	index_batchscan_reset(scan, true);
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		IndexScanBatch cached = scan->batchringbuf->cache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->killedItems)
+			pfree(cached->killedItems);
+		if (cached->currTuples)
+			pfree(cached->currTuples);
+		pfree(cached);
+	}
+
+	pfree(scan->batchringbuf);
+}
+
+/*
+ * index_batchscan_mark_pos - set a mark from scanPos position
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batchscan_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ */
+void
+index_batchscan_mark_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer loaded into the ring buffer
+	 */
+	if (markBatch && !INDEX_SCAN_BATCH_LOADED(scan, markPos->batch))
+	{
+		batchringbuf->markBatch = NULL;
+		tableam_util_free_batch(scan, markBatch);
+	}
+
+	/* copy the scan's position */
+	batchringbuf->markPos = batchringbuf->scanPos;
+	batchringbuf->markBatch = INDEX_SCAN_BATCH(scan,
+											   batchringbuf->markPos.batch);
+
+	/* scanPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchringbuf->markPos);
+}
+
+/*
+ * index_batchscan_restore_pos - restore mark to scanPos position
+ *
+ * Restores the scan to a position previously saved by
+ * index_batchscan_mark_pos.  The marked batch is restored as the current
+ * batch, allowing the scan to resume from the marked position.  Also notifies
+ * the index AM via a call to its amposreset routine, which allows it to
+ * invalidate any private state that independently tracks scan progress (such
+ * as array key state).
+ *
+ * Function currently just discards most batch ring buffer state.  It might
+ * make sense to teach it to hold on to other nearby batches (still-held
+ * batches that are likely to be needed once the scan finishes returning
+ * matching items from the restored batch) as an optimization.  Such a scheme
+ * would have the benefit of avoiding repeat calls to amgetbatch/repeatedly
+ * reading the same index pages.
+ */
+void
+index_batchscan_restore_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = scan->batchringbuf;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos ;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+
+	if (scanPos->batch == markPos->batch &&
+		scanPos->batch == batchringbuf->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		scanPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/* Remove all batches from the ring buffer except for the marked batch */
+	index_batchscan_reset(scan, false);
+
+	/*
+	 * "Append" markBatch, making the ring buffer appear as if it was the
+	 * first batch ever returned by amgetbatch for the scan
+	 */
+	markPos->batch = 0;
+	batchringbuf->scanPos = *markPos;
+	batchringbuf->nextBatch = batchringbuf->headBatch = markPos->batch;
+	INDEX_SCAN_BATCH_APPEND(scan, markBatch);
+	Assert(INDEX_SCAN_BATCH(scan, batchringbuf->scanPos.batch) == markBatch);
+
+	/*
+	 * Note: markBatch.killedItems[] might already contain dead items, and
+	 * might yet have more dead items saved.  tableam_util_free_batch is
+	 * prepared for that.
+	 */
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by table AMs
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * tableam_util_kill_scanpositem - record that scanPos item is dead
+ *
+ * Records an offset to the scanBatch item of the currently-read tuple, saving
+ * it in scanBatch's killedItems array. The items' index tuples will later be
+ * marked LP_DEAD when current scanBatch is freed by amfreebatch routine (see
+ * tableam_util_free_batch wrapper function).
+ */
+void
+tableam_util_kill_scanpositem(IndexScanDesc scan)
+{
+	BatchRingItemPos *scanPos = &scan->batchringbuf->scanPos;
+	IndexScanBatch scanBatch = INDEX_SCAN_BATCH(scan, scanPos->batch);
+
+	batch_assert_pos_valid(scan, scanPos);
+
+	if (scanBatch->killedItems == NULL)
+		scanBatch->killedItems = palloc_array(int, scan->maxitemsbatch);
+	if (scanBatch->numKilled < scan->maxitemsbatch)
+		scanBatch->killedItems[scanBatch->numKilled++] = scanPos->item;
+}
+
+/*
+ * tableam_util_free_batch - release resources associated with a batch
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page (in index AMs
+ * that implement that optimization).  Every amfreebatch routine must recycle
+ * the underlying batch memory by passing it to indexam_util_batch_release.
+ */
+void
+tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchringbuf->markBatch)
+		return;
+
+	/*
+	 * batch.killedItems[] is now in whatever order the scan returned items
+	 * in.  We might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify killedItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numKilled > 1)
+	{
+		qsort(batch->killedItems, batch->numKilled, sizeof(int),
+			  batch_compare_int);
+		batch->numKilled = qunique(batch->killedItems, batch->numKilled,
+								   sizeof(int), batch_compare_int);
+	}
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - unlock batch's shared buffer lock
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Performs extra steps required by amgetbatch callers
+ * in passing.
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We drop both
+ * the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).  We
+ * never drop the pin for an amgetbatch caller, though.
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (scan->batchringbuf)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+
+		/* Drop the lock */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+
+#ifdef USE_VALGRIND
+		if (!RelationUsesLocalBuffers(scan->indexRelation))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+#endif
+
+		/* table AM determines when it'll be safe to drop pins on batches */
+	}
+	else
+	{
+		/* amgetbitmap (not amgetbatch) caller */
+		Assert(scan->heapRelation == NULL);
+
+		/* drop both the lock and the pin */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+
+#ifdef USE_VALGRIND
+		if (!RelationUsesLocalBuffers(scan->indexRelation))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+#endif
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+}
+
+/*
+ * indexam_util_batch_alloc - allocate a new batch
+ *
+ * Used by index AMs that support amgetbatch interface (both during amgetbatch
+ * and amgetbitmap scans).
+ *
+ * Returns IndexScanBatch with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Index AMs that use batches should call this from either their amgetbatch or
+ * amgetbitmap routines only.  Note in particular that it cannot safely be
+ * called from a amfreebatch routine.
+ */
+IndexScanBatch
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	IndexScanBatch batch = NULL;
+
+	/* First look for an existing batch from ring buffer */
+	if (scan->batchringbuf != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchringbuf->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchringbuf->cache[i];
+				scan->batchringbuf->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (scan->xs_want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release - release allocated batch
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchringbuf)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+		}
+		else
+		{
+			/*
+			 * Use cache.  This is generally only beneficial when there are
+			 * many small rescans of an index.
+			 */
+			for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+			{
+				if (scan->batchringbuf->cache[i] == NULL)
+				{
+					/* found empty slot, we're done */
+					scan->batchringbuf->cache[i] = batch;
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+/*
+ * batch_compare_int - qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc..231da20e4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -471,7 +471,7 @@ proper.  A plain index scan will even recognize LP_UNUSED items in the
 heap (items that could be recycled but haven't been just yet) as "not
 visible" -- even when the heap page is generally considered all-visible.
 
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+Opportunistic LP_DEAD setting of known-dead index tuples during index scans
 (described in full in simple deletion, below) is also more complicated for
 index scans that drop their leaf page pins.  We must be careful to avoid
 LP_DEAD-marking any new index tuple that looks like a known-dead index
@@ -481,7 +481,7 @@ new, unrelated index tuple, on the same leaf page, which has the same
 original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.
 
-We handle this kill_prior_tuple race condition by having affected index
+We handle this LP_DEAD setting race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
@@ -735,7 +735,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 70b524dfe..286f95d3e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1037,6 +1037,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 2ba1ca660..e24abf466 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,96 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+	newbatch->items[itemIndex].allVisible = false;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+	item->allVisible = false;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+	item->allVisible = false;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2819,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2910,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee65..690f0534f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amfreebatch = btfreebatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -222,13 +223,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,45 +242,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * If we've already initialized this scan, we can just advance it in
 		 * the appropriate direction.  If we haven't done so yet, we call
-		 * _bt_first() to get the first item in the scan.
+		 * _bt_first() to get the first batch in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -289,6 +265,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -297,29 +274,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -347,8 +324,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -362,19 +337,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
 
 	return scan;
 }
@@ -388,72 +353,39 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (BufferIsValid(batch->buf))
+	{
+		/* table AM didn't unpin page earlier -- do it now */
+		Assert(!scan->MVCCScan);
+
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -462,116 +394,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -887,15 +751,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1045,8 +900,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 32ae0bda8..03e8931d2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,53 +26,23 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -861,20 +831,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -888,8 +854,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	IndexScanBatch firstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -905,6 +873,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
+		indexam_util_batch_release(scan, firstbatch);
 		return false;
 	}
 
@@ -914,7 +883,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return false;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -931,14 +903,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1238,7 +1206,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1506,12 +1474,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(firstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1527,23 +1495,24 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, firstbatch);
 			return false;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1561,168 +1530,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * Cancel primitive index scans that were scheduled when priorbatch's call
+	 * to _bt_readpage happened to use the opposite direction to the one that
+	 * we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1732,73 +1602,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1808,102 +1695,65 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior batch's nextPage or prevPage (depending on the current
+ * scan direction), and lastcurrblkno is the prior batch's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns batch containing data from the next page that has
+ * at least one matching item.  If there are no more matching items in the
+ * given scan direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for lastcurrblkno, a page to the left of
+	 * blkno (or to the right, when the scan is moving backwards)
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1911,17 +1761,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1936,19 +1786,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2174,25 +2044,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2203,7 +2071,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2229,9 +2097,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5c50f0dd1..9aee66cd1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -21,15 +21,12 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -160,111 +157,47 @@ _bt_freestack(BTStack stack)
 	}
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page.  When we
+ * return, it still won't be locked.  It'll continue to hold whatever pins
+ * were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
- * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
- * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
- * page, so the page's TIDs can't have been recycled by now.  There's no risk
- * that we'll confuse a new index tuple that happens to use a recycled TID
- * with a now-removed tuple with the same TID (that used to be on this same
- * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
- * the page LSN having not changed since back when _bt_readpage saw the page.
- * We totally give up on setting LP_DEAD bits when the page LSN changed.
- *
- * We give up much less often during !so->dropPin scans, but it still happens.
- * We cope with cases where items have moved right due to insertions.  If an
- * item has moved off the current page due to a split, we'll fail to find it
- * and just give up on it.
+ * LP_DEAD.  We must condition setting LP_DEAD bits on the page LSN having not
+ * changed since back when _bt_readpage saw the page.  We totally give up on
+ * setting LP_DEAD bits when the page LSN changed.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
+	XLogRecPtr	latestlsn;
 
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
 	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
-	{
-		/*
-		 * We have held the pin on this page since we read the index tuples,
-		 * so all we need to do is lock it.  The pin will have prevented
-		 * concurrent VACUUMs from recycling any of the TIDs on the page.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
-		_bt_lockbuf(rel, buf, BT_READ);
-	}
-	else
-	{
-		XLogRecPtr	latestlsn;
-
-		Assert(!BTScanPosIsPinned(so->currPos));
-		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
-		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
-		{
-			/* Modified, give up on hinting */
-			_bt_relbuf(rel, buf);
-			return;
-		}
-
-		/* Unmodified, hinting is safe */
+		/* Modified, give up on hinting */
+		_bt_relbuf(rel, buf);
+		return;
 	}
 
 	page = BufferGetPage(buf);
@@ -272,17 +205,16 @@ _bt_killitems(IndexScanDesc scan)
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in leaf page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
+			   offnum >= batch->items[batch->killedItems[i - 1]].indexOffset);
 
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
@@ -298,12 +230,6 @@ _bt_killitems(IndexScanDesc scan)
 				int			nposting = BTreeTupleGetNPosting(ituple);
 				int			j;
 
-				/*
-				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
-				 * possible that this posting list tuple wasn't a posting list
-				 * tuple when we first encountered its heap TIDs.
-				 */
 				for (j = 0; j < nposting; j++)
 				{
 					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
@@ -311,12 +237,8 @@ _bt_killitems(IndexScanDesc scan)
 					if (!ItemPointerEquals(item, &kitem->heapTid))
 						break;	/* out of posting list loop */
 
-					/*
-					 * kitem must have matching offnum when heap TIDs match,
-					 * though only in the common case where the page can't
-					 * have been concurrently modified
-					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum);
+					Assert(!kitem->allVisible);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -332,8 +254,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * kitem is also the last heap TID in the last index tuple
 					 * correctly -- posting tuple still gets killed).
 					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+					if (pi < batch->numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -383,10 +305,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
-		_bt_unlockbuf(rel, buf);
-	else
-		_bt_relbuf(rel, buf);
+	_bt_relbuf(rel, buf);
 }
 
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9f5379b87..a18a2fa9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb11168..ea0add1c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -135,7 +135,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1972,7 +1972,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,15 +1986,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3858,15 +3855,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				nheapfetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3887,6 +3885,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument.nsearches;
+				nheapfetches = indexstate->ioss_Instrument.nheapfetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3910,9 +3909,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			nheapfetches += winstrument->nheapfetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, nheapfetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 635679cc1..54c2403da 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -884,7 +884,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 90a68c0d1..0dfb01337 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 6ae0f9595..3041d5c72 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 72f2bff77..ab487b9e6 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,7 +205,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -213,7 +213,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -666,12 +666,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 058a59ef5..2580a0139 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -202,6 +202,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument.nsearches;
+		Assert(node->biss_Instrument.nheapfetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d093745..ff3e8f302 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -90,18 +87,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,78 +111,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -238,16 +163,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -407,13 +322,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -433,6 +341,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument.nsearches;
+		winstrument->nheapfetches += node->ioss_Instrument.nheapfetches;
 	}
 
 	/*
@@ -784,13 +693,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -850,12 +758,12 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 84823f0b6..c34a13a87 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -812,6 +812,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument.nsearches;
+		Assert(node->iss_Instrument.nheapfetches == 0);
 	}
 
 	/*
@@ -1719,7 +1720,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1784,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 29cb60d6b..cfa022f0c 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -43,7 +43,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a90e1c9ee..39a5ea299 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 0b1d80b5b..76b0d035f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -890,7 +890,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 29fec6555..a7213654a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7124,13 +7123,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
+	IndexScanInstrumentation instrument;
 
 	/*
 	 * We use the index-only-scan machinery for this.  With mostly-static
@@ -7179,56 +7175,32 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
-								 &SnapshotNonVacuumable, NULL,
+	/*
+	 * Set it up for instrumented index-only scan.  We need the
+	 * instrumentation to monitor the number of heap fetches.
+	 */
+	memset(&instrument, 0, sizeof(instrument));
+	index_scan = index_beginscan(heapRel, indexRel, true,
+								 &SnapshotNonVacuumable, &instrument,
 								 1, 0);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (instrument.nheapfetches > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7261,8 +7233,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..7fd98dba6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -146,10 +146,11 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/test/isolation/expected/index-only-scan-visibility.out b/src/test/isolation/expected/index-only-scan-visibility.out
new file mode 100644
index 000000000..616ab19dc
--- /dev/null
+++ b/src/test/isolation/expected/index-only-scan-visibility.out
@@ -0,0 +1,32 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vacuum s2_delete s1_begin s1_prepare s1_fetch_1 s2_vacuum s1_fetch_all s1_commit
+step s2_vacuum: 
+	VACUUM (TRUNCATE false) ios_vis_test;
+
+step s2_delete: 
+	DELETE FROM ios_vis_test WHERE a > 1;
+
+step s1_begin: BEGIN;
+step s1_prepare: 
+	DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_vis_test WHERE a > 0;
+
+step s1_fetch_1: 
+	FETCH FROM foo;
+
+a
+-
+1
+(1 row)
+
+step s2_vacuum: 
+	VACUUM (TRUNCATE false) ios_vis_test;
+
+step s1_fetch_all: 
+	FETCH ALL FROM foo;
+
+a
+-
+(0 rows)
+
+step s1_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 01ff1c658..cff630591 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -17,6 +17,7 @@ test: partial-index
 test: two-ids
 test: multiple-row-versions
 test: index-only-scan
+test: index-only-scan-visibility
 test: index-only-bitmapscan
 test: predicate-lock-hot-tuple
 test: update-conflict-out
diff --git a/src/test/isolation/specs/index-only-scan-visibility.spec b/src/test/isolation/specs/index-only-scan-visibility.spec
new file mode 100644
index 000000000..86dff24ee
--- /dev/null
+++ b/src/test/isolation/specs/index-only-scan-visibility.spec
@@ -0,0 +1,84 @@
+# Index-only scan visibility test
+#
+# Verify that VACUUM doesn't block forever trying to acquire a cleanup lock
+# on an index page while an index-only scan has a cursor open.  With MVCC
+# scans, we eagerly drop the pin on each batch's leaf page after caching
+# visibility info, allowing VACUUM to proceed.
+setup
+{
+	-- Low fillfactor + wide tuples = multiple heap blocks with few rows
+	CREATE TABLE ios_vis_test
+  (a int NOT NULL,
+   b int NOT NULL,
+   pad char(1024) DEFAULT '')
+	WITH (autovacuum_enabled = false, fillfactor = 10);
+
+	INSERT INTO ios_vis_test SELECT g.i, g.i FROM generate_series(1, 10) g(i);
+
+	CREATE INDEX ios_vis_test_a ON ios_vis_test (a);
+}
+
+teardown
+{
+	DROP TABLE ios_vis_test;
+}
+
+
+session s1
+
+# Force index-only scan:
+setup {
+	SET enable_bitmapscan = false;
+	SET enable_indexonlyscan = true;
+	SET enable_indexscan = true;
+}
+
+step s1_begin { BEGIN; }
+step s1_commit { COMMIT; }
+
+step s1_prepare {
+	DECLARE foo NO SCROLL CURSOR FOR SELECT a FROM ios_vis_test WHERE a > 0;
+}
+
+step s1_fetch_1 {
+	FETCH FROM foo;
+}
+
+step s1_fetch_all {
+	FETCH ALL FROM foo;
+}
+
+
+session s2
+
+# Keep row 1 so cursor has a row to "rest" on
+step s2_delete {
+	DELETE FROM ios_vis_test WHERE a > 1;
+}
+
+# Disable truncation to avoid unrelated AccessExclusiveLock waits
+step s2_vacuum {
+	VACUUM (TRUNCATE false) ios_vis_test;
+}
+
+permutation
+	# VACUUM first, to ensure VM exists
+	s2_vacuum
+
+	# Delete nearly all rows
+	s2_delete
+
+	# Open a cursor and fetch one row, pinning the first leaf page
+	s1_begin
+	s1_prepare
+	s1_fetch_1
+
+  # This VACUUM must not block forever waiting for a cleanup lock.  It will
+  # mark heap pages as all-visible, but the scan has already cached visibility
+  # info, so subsequent fetches correctly see no additional rows.
+	s2_vacuum
+
+  # Must return no rows (deleted rows are not visible to our MVCC snapshot)
+	s1_fetch_all
+
+	s1_commit
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 9eb8f0a6c..9d4fddec4 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,11 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3f3a888fd..c3f2a9fc8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -226,8 +226,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -256,6 +254,9 @@ BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
 BatchMVCCState
+BatchMatchingItem
+BatchRingBuffer
+BatchRingItemPos
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
@@ -1286,6 +1287,8 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatch
+IndexScanBatchData
 IndexScanDesc
 IndexScanDescData
 IndexScanInstrumentation
@@ -3471,18 +3474,17 @@ amcanreturn_function
 amcostestimate_function
 amendscan_function
 amestimateparallelscan_function
+amgetbatch_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2026-01-18 23:51 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-13 20:36 Re: index prefetching Peter Geoghegan <[email protected]>
2026-01-18 23:51 ` Peter Geoghegan <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox