Re: index prefetching - Peter Geoghegan

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Peter Geoghegan <[email protected]>
To: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Thu, 12 Mar 2026 15:44:32 -0400
Message-ID: <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com> (raw)
In-Reply-To: <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
References: <il7jtfowpatrlg33qb5plj7v7pferes4ogerq5fdczszi4kokh@sbwvb2ukfgos>
	<[email protected]>
	<ws47e3wly6skt36b23zy5qfvcxzueo6od3uicunuodsqnxl7os@7v2qi7qkxzbz>
	<CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com>
	<64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep>
	<CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com>
	<d2d4qofb5ajg2ftvm6h56oi4utdwpzkqfjd7z2y4vod5qaub4h@ixyotvfut3mg>
	<CAH2-WznoD7vhjZNDj-5OrLp+1fjvW6ypEUwZ1=ieadefgWaTDQ@mail.gmail.com>
	<ayjpwpm5cn6ng2bgedhz3ckbjrxocbsbywhlghwxxz2p6a5tgr@jubomhsjkvcl>
	<CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com>
	<vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>

On Tue, Mar 10, 2026 at 6:29 PM Andres Freund <[email protected]> wrote:
> This seems pretty unrelated to my concern.  I have a problem with the fact
> that heapam knows which specific resources need to be held (&released) to
> prevent concurrency issues during an index only scan.  I am *NOT* concerned
> with there needing to be a pin and heapam triggering the release of that
> resource.

Attached v13 does things this way.

All index AMs that use amgetbatch must register a new amreleasebatch
callback in v13. This callback releases any buffer pin held as an
interlock against unsafe concurrent TID recycling by VACUUM -- at
least for nbtree and hash. The buffer itself is now stored in the
portion of each batch used as opaque index AM state, so nothing stops
other index AMs that support amgetbatch in the future from using a
different kind of interlock, from holding multiple pins instead of
just one, etc.

We have a generic policy that determines (at the level of each scan)
whether an interlock is required at all. If an interlock isn't
required, index AMs will drop both their lock and pin together.
Otherwise, they will only drop the lock, and only drop the
pin/abstract interlock when the new amreleasebatch callback is called
by the table AM. Our policy is to assume that we don't need a TID
recycling interlock except during index-only scans and non-MVCC scans
(regardless of the table AM and index AM involved).

Separately, I removed the requirement that amgetbatch index AMs
support Valgrind instrumentation requests like those nbtree has long
supported. And I removed all such instrumentation from the hash index
AM.

> > Is this purely because of the potential to
> > disrupt the read stream's management of the backend's buffer pin
> > limit?
>
> I'm not particularly bothered by a small number of extra buffer pins,
> particularly for AMs other than nbtree. They won't cause issues in any real
> world setups.

Okay. Then I don't think that we need to do anything here. As you
know, nbtree never holds buffer pins for its own internal purposes.
The hash index AM continues to hang on to up to 2 extra buffer pins
for its own purposes, just like on master.

--
Peter Geoghegan


Attachments:

  [application/octet-stream] v13-0019-Make-hash-index-AM-use-amgetbatch-interface.patch (41.8K, 2-v13-0019-Make-hash-index-AM-use-amgetbatch-interface.patch)
  download | inline diff:
From dd755c2d8d08b59d66a57fdf562215812822ad9e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v13 19/19] Make hash index AM use amgetbatch interface.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface.  Plain index scans of hash indexes now return
matching items in batches consisting of all of the matches from a given
bucket or overflow page.  This gives the core executor the ability to
perform optimizations like index prefetching during hash index scans.

Note that hash index scans will now drop index page buffer pins eagerly
(actually, the table AM will do so on behalf of the hash index AM).
This is a hard requirement for any index AM that adopts the new
amgetbatch interface.  Guaranteeing that open batches won't hold buffer
pins on index pages greatly simplifies resource management during index
prefetching, where the read stream is expected to hold many pins on heap
pages (that's why amgetbatch makes this a hard requirement).

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  85 ++-----
 src/backend/access/hash/README       |  31 +--
 src/backend/access/hash/hash.c       | 210 +++++++++++------
 src/backend/access/hash/hash_xlog.c  |   4 +-
 src/backend/access/hash/hashpage.c   |  19 +-
 src/backend/access/hash/hashsearch.c | 338 +++++++++++----------------
 src/backend/access/hash/hashutil.c   | 129 +---------
 src/tools/pgindent/typedefs.list     |   2 -
 8 files changed, 325 insertions(+), 493 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..84ec6bc40 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,57 +100,25 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
+/* Per-batch data private to the hash index AM */
+typedef struct HashBatchData
 {
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
+	Buffer		buf;			/* index page buffer pin (TID reuse interlock) */
+	BlockNumber currPage;		/* index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+} HashBatchData;
 
-typedef struct HashScanPosData
+/*
+ * Access the hash-private per-batch data from an IndexScanBatch pointer.
+ * This follows the standard convention for index AM opaque state: it can be
+ * found at a fixed negative offset from the IndexScanBatch pointer.
+ */
+static inline HashBatchData *
+HashBatchGetData(IndexScanBatch batch)
 {
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
+	return (HashBatchData *) ((char *) batch - MAXALIGN(sizeof(HashBatchData)));
+}
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -178,15 +146,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +327,15 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch hashgetbatch(IndexScanDesc scan,
+								   IndexScanBatch priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void hashreleasebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +408,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 IndexScanBatch priorbatch);
+extern IndexScanBatch _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +440,6 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index fc9031117..972bb666b 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -255,28 +255,29 @@ The reader algorithm is:
 		retake the buffer content lock on new bucket
 		arrange to scan the old bucket normally and the new bucket for
          tuples which are not moved-by-split
--- then, per read request:
+-- then, per batch (page) request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
 	the pin on the primary bucket throughout the scan)
-	save all the matching tuples from current index page into an items array
-	release pin and content lock (but if it is primary bucket page retain
-	its pin till the end of the scan)
-	get tuple from an item array
+	save all the matching tuples from current index page into a batch
+	release content lock on current page return batch to table AM (table AM
+	will drop batch's buffer pin, though primary bucket page pin is kept
+	until the end of the scan)
 -- at scan shutdown:
-	release all pins still held
+	release scan-owned pins (e.g., primary bucket page pin) as needed
 
 Holding the buffer pin on the primary bucket page for the whole scan prevents
-the reader's current-tuple pointer from being invalidated by splits or
-compactions.  (Of course, other buckets can still be split or compacted.)
+the bucket from being reorganized by splits or compactions while the scan is
+in progress.  (Of course, other buckets can still be split or compacted.)
 
-To minimize lock/unlock traffic, hash index scan always searches the entire
-hash page to identify all the matching items at once, copying their heap tuple
-IDs into backend-local storage. The heap tuple IDs are then processed while not
-holding any page lock within the index thereby, allowing concurrent insertion
-to happen on the same index page without any requirement of re-finding the
-current scan position for the reader. We do continue to hold a pin on the
-bucket page, to protect against concurrent deletions and bucket split.
+To minimize lock/unlock traffic, hash index scans always search the entire
+hash page to identify all the matching items at once, returning them in
+batches to the table AM.  The table AM processes batches while no page lock
+is held within the index, allowing concurrent insertion to happen on the
+same index page without any requirement of re-finding the current scan
+position for the reader.  The table AM controls when batch buffer pins are
+dropped.  We do continue to hold a pin on the primary bucket page, to
+protect against concurrent bucket splits.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 92824aa5d..18e747a27 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,10 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amkillitemsbatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amkillitemsbatch = hashkillitemsbatch,
+		.amreleasebatch = hashreleasebatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -286,53 +287,28 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+hashgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
+		Relation	rel = scan->indexRelation;
 
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
+		_hash_dropscanbuf(rel, so);
 
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +318,26 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (int itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].tableTid, 1, true);
+			ntids++;
+		}
 
 		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
+		 * _hash_next releases the prior batch for bitmap callers before
+		 * allocating the next one, so only one batch is ever used at a time
 		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,17 +359,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
+	scan->batch_index_opaque_size = MAXALIGN(sizeof(HashBatchData));
+	scan->batch_tuples_workspace = 0;
 
 	return scan;
 }
@@ -408,18 +383,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +393,112 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	HashBatchData *hashbatch = HashBatchGetData(batch);
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum,
+				maxoff;
+	bool		killedsomething = false;
+	XLogRecPtr	latestlsn;
+
+	Assert(batch->numDead > 0);
+
+	buf = _hash_getbuf(rel, hashbatch->currPage, HASH_READ,
+					   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_hash_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = HashPageGetOpaque(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate through batch->deadItems[] in index page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->tableTid))
+			{
+				if (!killedsomething)
+				{
+					/*
+					 * Use the hint bit infrastructure to check if we can
+					 * update the page while just holding a share lock. If we
+					 * are not allowed, there's no point continuing.
+					 */
+					if (!BufferBeginSetHintBits(buf))
+						goto unlock_page;
+				}
+
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint. Whenever
+	 * we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+unlock_page:
+	_hash_relbuf(rel, buf);
+}
+
+/*
+ *	hashreleasebatch() -- Release batch's index page buffer pin
+ *
+ * Called by the table AM (via amreleasebatch) when it's safe to drop the
+ * buffer pin held to prevent concurrent TID recycling by VACUUM.
+ * Must be idempotent -- safe to call when the pin has already been released.
+ */
+void
+hashreleasebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	HashBatchData *hashbatch = HashBatchGetData(batch);
+
+	if (BufferIsValid(hashbatch->buf))
+	{
+		ReleaseBuffer(hashbatch->buf);
+		hashbatch->buf = InvalidBuffer;
+	}
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +508,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 2060620c7..e26ee8bb9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1141,14 +1141,14 @@ hash_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In hash bucket and overflow pages, it is possible to modify the
 		 * LP_FLAGS without emitting any WAL record. Hence, mask the line
-		 * pointer flags. See hashgettuple(), _hash_kill_items() for details.
+		 * pointer flags. See hashkillitemsbatch() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * It is possible that the hint bit LH_PAGE_HAS_DEAD_TUPLES may remain
-	 * unlogged. So, mask it. See _hash_kill_items() for details.
+	 * unlogged. So, mask it. See hashkillitemsbatch() for details.
 	 */
 	opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 263bc73f1..388d2442a 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 89d1c5bc6..73babe8ca 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,105 +22,87 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   IndexScanBatch batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   IndexScanBatch batch);
+static inline void _hash_saveitem(IndexScanBatch batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_hash_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
+	HashBatchData *hashpriorbatch = HashBatchGetData(priorbatch);
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	IndexScanBatch batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		blkno = hashpriorbatch->nextPage;
+	else
+		blkno = hashpriorbatch->prevPage;
 
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
-	}
+	/*
+	 * For bitmap scan callers, release the prior batch now so that the
+	 * allocation below can reuse its memory.  This way bitmap scans never
+	 * need more than one batch allocation.
+	 */
+	if (!scan->usebatchring)
+		indexam_util_batch_release(scan, priorbatch);
+
+	if (!BlockNumberIsValid(blkno))
+		return NULL;
+
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	if (end_of_scan)
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -270,22 +252,20 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin.
  */
-bool
+IndexScanBatch
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -296,7 +276,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	IndexScanBatch batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -326,7 +306,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -419,191 +399,152 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
+	HashBatchData *hashbatch = HashBatchGetData(batch);
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	hashbatch->buf = buf;
+	hashbatch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (hashbatch->buf == so->hashso_bucket_buf ||
+		hashbatch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by the core code/table AM)
+		 * before the hashso_bucket_buf/hashso_split_bucket_buf references are
+		 * released.
+		 */
+		IncrBufferRefCount(hashbatch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		hashbatch->prevPage = InvalidBlockNumber;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		hashbatch->prevPage = opaque->hasho_prevblkno;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* we saved one or more matches in batch.items[] */
+	indexam_util_batch_unlock(scan, batch, hashbatch->buf);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   IndexScanBatch batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -640,7 +581,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -687,7 +628,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -706,13 +647,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(IndexScanBatch batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
-	currItem->heapTid = itup->t_tid;
+	currItem->tableTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 3e16119d0..331d5f4da 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -16,7 +16,6 @@
 
 #include "access/hash.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "port/pg_bitutils.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
@@ -33,7 +32,7 @@ _hash_checkqual(IndexScanDesc scan, IndexTuple itup)
 	/*
 	 * Currently, we can't check any of the scan conditions since we do not
 	 * have the original index entry value to supply to the sk_func. Always
-	 * return true; we expect that hashgettuple already set the recheck flag
+	 * return true; we expect that hashgetbatch already set the recheck flag
 	 * to make the main indexscan code do it.
 	 */
 #ifdef NOT_USED
@@ -505,129 +504,3 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
-
-/*
- * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
- * told us were killed.
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
- *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
- */
-void
-_hash_kill_items(IndexScanDesc scan)
-{
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber offnum,
-				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
-	bool		killedsomething = false;
-	bool		havePin = false;
-
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
-
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
-	{
-		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
-		 */
-		havePin = true;
-		buf = so->currPos.buf;
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-	}
-	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-
-	page = BufferGetPage(buf);
-	opaque = HashPageGetOpaque(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
-
-		offnum = currItem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-
-			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(so->currPos.buf))
-						goto unlock_page;
-				}
-
-				/* found the item */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint. Whenever
-	 * we mark anything LP_DEAD, we also set the page's
-	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
-	 */
-	if (killedsomething)
-	{
-		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		BufferFinishSetHintBits(so->currPos.buf, true, true);
-	}
-
-unlock_page:
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-	else
-		_hash_relbuf(rel, buf);
-}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4d470a051..76c26ea46 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1205,8 +1205,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.53.0



  [application/octet-stream] v13-0016-WIP-read_stream-Prevent-distance-from-decaying-t.patch (2.9K, 3-v13-0016-WIP-read_stream-Prevent-distance-from-decaying-t.patch)
  download | inline diff:
From fa7ca9459575f88443d0885f51804add8b587a4f Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 17:25:25 -0500
Subject: [PATCH v13 16/19] WIP: read_stream: Prevent distance from decaying
 too quickly

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 33 ++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e28ab5de0..e3c16bd17 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	uint16		distance_decay_holdoff;
 	int16		initialized_buffers;
 	int16		resume_distance;
 	int			read_buffers_flags;
@@ -364,9 +365,22 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		/*
+		 * If there currently is no IO in progress, and we have not needed to
+		 * issue IO recently, decay the look-ahead distance.  We detect if we
+		 * had to issue IO recently by having a decay holdoff that's set to
+		 * the max lookahead distance whenever we need to do IO.  This is
+		 * important to ensure we eventually reach a high enough distance to
+		 * perform IO asynchronously when starting out with a small lookahead
+		 * distance.
+		 */
+		if (stream->distance > 1 && stream->ios_in_progress == 0)
+		{
+			if (stream->distance_decay_holdoff == 0)
+				stream->distance--;
+			else
+				stream->distance_decay_holdoff--;
+		}
 	}
 	else
 	{
@@ -702,6 +716,7 @@ read_stream_begin_impl(int flags,
 	stream->seq_blocknum = InvalidBlockNumber;
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
+	stream->distance_decay_holdoff = 0;
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -944,6 +959,18 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
+		/*
+		 * As we needed IO, prevent distance from being reduced within our
+		 * maximum lookahead window. This avoids having distance collapse too
+		 * quickly in workloads where most of the required blocks are cached,
+		 * but where the remaining IOs are a sufficient enough factor to cause
+		 * a substantial slowdown if executed synchronously.
+		 *
+		 * XXX: Not obvious whether we should use max_ios or
+		 * max_pinned_buffers. Or something else entirely.
+		 */
+		stream->distance_decay_holdoff = stream->max_ios;
+
 		/*
 		 * Look-ahead distance ramps up rapidly after we needed to wait for
 		 * IO. We only increase the distance when we needed to wait, to avoid
-- 
2.53.0



  [application/octet-stream] v13-0018-Add-fake-LSN-support-to-hash-index-AM.patch (14.0K, 4-v13-0018-Add-fake-LSN-support-to-hash-index-AM.patch)
  download | inline diff:
From 5a9042c2c273b19cf59780c58462ce8c592fc30e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:32:52 -0500
Subject: [PATCH v13 18/19] Add fake LSN support to hash index AM.

This is preparation for an upcoming patch that will add the amgetbatch
interface and switch hash over to it (from amgettuple).  We need fake
LSNs to make it safe to apply behavior that is equivalent to nbtree's
previous dropPin behavior that works with unlogged hash index scans.

The commit that will add hashgetbatch will replace _hash_kill_items with
a new hashkillitemsbatch routine.  This will be very similar to the
btkillitemsbatch routine added by commit XXXXX.  In particular, it will
use the same "did the index page's LSN change since the page was first
read?" trick.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
---
 src/backend/access/hash/hash.c       |  21 +++--
 src/backend/access/hash/hashinsert.c |  20 +++--
 src/backend/access/hash/hashovfl.c   | 111 ++++++++++++++++-----------
 src/backend/access/hash/hashpage.c   |  22 +++---
 4 files changed, 105 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5b5c5c6fa..92824aa5d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -476,6 +476,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Buffer		metabuf = InvalidBuffer;
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
+	XLogRecPtr	recptr;
 
 	tuples_removed = 0;
 	num_index_tuples = 0;
@@ -615,7 +616,6 @@ loop_top:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_update_meta_page xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.ntuples = metap->hashm_ntuples;
 
@@ -625,8 +625,11 @@ loop_top:
 		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -699,6 +702,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	Buffer		buf;
 	Bucket		new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
 	bool		bucket_dirty = false;
+	XLogRecPtr	recptr;
 
 	blkno = bucket_blkno;
 	buf = bucket_buf;
@@ -821,7 +825,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			if (RelationNeedsWAL(rel))
 			{
 				xl_hash_delete xlrec;
-				XLogRecPtr	recptr;
 
 				xlrec.clear_dead_marking = clear_dead_marking;
 				xlrec.is_primary_bucket_page = (buf == bucket_buf);
@@ -846,8 +849,11 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 									ndeletable * sizeof(OffsetNumber));
 
 				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
-				PageSetLSN(BufferGetPage(buf), recptr);
 			}
+			else
+				recptr = XLogGetFakeLSN(rel);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
 
 			END_CRIT_SECTION();
 		}
@@ -906,14 +912,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		/* XLOG stuff */
 		if (RelationNeedsWAL(rel))
 		{
-			XLogRecPtr	recptr;
-
 			XLogBeginInsert();
 			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
 
 			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
-			PageSetLSN(page, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(page, recptr);
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 0cefbacc9..3395bbc13 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -50,6 +50,7 @@ _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted)
 	uint32		hashkey;
 	Bucket		bucket;
 	OffsetNumber itup_off;
+	XLogRecPtr	recptr;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -216,7 +217,6 @@ restart_insert:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_insert xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.offnum = itup_off;
 
@@ -229,10 +229,12 @@ restart_insert:
 		XLogRegisterBufData(0, itup, IndexTupleSize(itup));
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
-
-		PageSetLSN(BufferGetPage(buf), recptr);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(buf), recptr);
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -372,6 +374,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 	Page		page = BufferGetPage(buf);
 	HashPageOpaque pageopaque;
 	HashMetaPage metap;
+	XLogRecPtr	recptr;
 
 	/* Scan each tuple in page to see if it is marked as LP_DEAD */
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -424,7 +427,6 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		if (RelationNeedsWAL(rel))
 		{
 			xl_hash_vacuum_one_page xlrec;
-			XLogRecPtr	recptr;
 
 			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
@@ -445,10 +447,12 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
 
 			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
-
-			PageSetLSN(BufferGetPage(buf), recptr);
-			PageSetLSN(BufferGetPage(metabuf), recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
 
 		END_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 8cfb6ce75..abd1f91fa 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -132,6 +132,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	uint32		i,
 				j;
 	bool		page_found = false;
+	XLogRecPtr	recptr;
 
 	/*
 	 * Write-lock the tail page.  Here, we need to maintain locking order such
@@ -381,7 +382,6 @@ found:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_hash_add_ovfl_page xlrec;
 
 		xlrec.bmpage_found = page_found;
@@ -408,18 +408,20 @@ found:
 		XLogRegisterBufData(4, &metap->hashm_firstfree, sizeof(uint32));
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
-
-		PageSetLSN(BufferGetPage(ovflbuf), recptr);
-		PageSetLSN(BufferGetPage(buf), recptr);
-
-		if (BufferIsValid(mapbuf))
-			PageSetLSN(BufferGetPage(mapbuf), recptr);
-
-		if (BufferIsValid(newmapbuf))
-			PageSetLSN(BufferGetPage(newmapbuf), recptr);
-
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(ovflbuf), recptr);
+	PageSetLSN(BufferGetPage(buf), recptr);
+
+	if (BufferIsValid(mapbuf))
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+	if (BufferIsValid(newmapbuf))
+		PageSetLSN(BufferGetPage(newmapbuf), recptr);
+
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -510,7 +512,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	Bucket		bucket PG_USED_FOR_ASSERTS_ONLY;
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		nextbuf = InvalidBuffer;
-	bool		update_metap = false;
+	bool		update_metap = false,
+				mod_wbuf,
+				is_prim_bucket_same_wrt,
+				is_prev_bucket_same_wrt;
+	XLogRecPtr	recptr;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -641,19 +647,21 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		MarkBufferDirty(metabuf);
 	}
 
+	/* Determine which pages WAL record modifies */
+	mod_wbuf = false;
+	is_prim_bucket_same_wrt = (wbuf == bucketbuf);
+	is_prev_bucket_same_wrt = (wbuf == prevbuf);
+
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_squeeze_page xlrec;
-		XLogRecPtr	recptr;
-		int			i;
-		bool		mod_wbuf = false;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
 		xlrec.ntups = nitups;
-		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf);
-		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf);
+		xlrec.is_prim_bucket_same_wrt = is_prim_bucket_same_wrt;
+		xlrec.is_prev_bucket_same_wrt = is_prev_bucket_same_wrt;
 
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHashSqueezePage);
@@ -662,14 +670,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		 * bucket buffer was not changed, but still needs to be registered to
 		 * ensure that we can acquire a cleanup lock on it during replay.
 		 */
-		if (!xlrec.is_prim_bucket_same_wrt)
+		if (!is_prim_bucket_same_wrt)
 		{
 			uint8		flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
 
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		if (xlrec.ntups > 0)
+		if (nitups > 0)
 		{
 			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 
@@ -678,10 +686,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 			XLogRegisterBufData(1, itup_offsets,
 								nitups * sizeof(OffsetNumber));
-			for (i = 0; i < nitups; i++)
+			for (int i = 0; i < nitups; i++)
 				XLogRegisterBufData(1, itups[i], tups_size[i]);
 		}
-		else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+		else if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt)
 		{
 			uint8		wbuf_flags;
 
@@ -691,10 +699,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			 * if it is the same as primary bucket buffer or update the
 			 * nextblkno if it is same as the previous bucket buffer.
 			 */
-			Assert(xlrec.ntups == 0);
+			Assert(nitups == 0);
 
 			wbuf_flags = REGBUF_STANDARD;
-			if (!xlrec.is_prev_bucket_same_wrt)
+			if (!is_prev_bucket_same_wrt)
 			{
 				wbuf_flags |= REGBUF_NO_CHANGE;
 			}
@@ -714,7 +722,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		 * prevpage.  During replay, we can directly update the nextblock in
 		 * writepage.
 		 */
-		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+		if (BufferIsValid(prevbuf) && !is_prev_bucket_same_wrt)
 			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
 
 		if (BufferIsValid(nextbuf))
@@ -730,23 +738,33 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		}
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
-
-		/* Set LSN iff wbuf is modified. */
-		if (mod_wbuf)
-			PageSetLSN(BufferGetPage(wbuf), recptr);
-
-		PageSetLSN(BufferGetPage(ovflbuf), recptr);
-
-		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
-			PageSetLSN(BufferGetPage(prevbuf), recptr);
-		if (BufferIsValid(nextbuf))
-			PageSetLSN(BufferGetPage(nextbuf), recptr);
-
-		PageSetLSN(BufferGetPage(mapbuf), recptr);
-
-		if (update_metap)
-			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else						/* !RelationNeedsWAL(rel) */
+	{
+		recptr = XLogGetFakeLSN(rel);
+
+		/* Determine if wbuf is modified */
+		if (nitups > 0)
+			mod_wbuf = true;
+		else if (is_prev_bucket_same_wrt)
+			mod_wbuf = true;
+	}
+
+	/* Set LSN iff wbuf is modified. */
+	if (mod_wbuf)
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+
+	PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+	if (BufferIsValid(prevbuf) && !is_prev_bucket_same_wrt)
+		PageSetLSN(BufferGetPage(prevbuf), recptr);
+	if (BufferIsValid(nextbuf))
+		PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+	PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+	if (update_metap)
+		PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -959,6 +977,8 @@ readpage:
 
 				if (nitups > 0)
 				{
+					XLogRecPtr	recptr;
+
 					Assert(nitups == ndeletable);
 
 					/*
@@ -986,7 +1006,6 @@ readpage:
 					/* XLOG stuff */
 					if (RelationNeedsWAL(rel))
 					{
-						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
 
 						xlrec.ntups = nitups;
@@ -1018,10 +1037,12 @@ readpage:
 											ndeletable * sizeof(OffsetNumber));
 
 						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
-
-						PageSetLSN(BufferGetPage(wbuf), recptr);
-						PageSetLSN(BufferGetPage(rbuf), recptr);
 					}
+					else
+						recptr = XLogGetFakeLSN(rel);
+
+					PageSetLSN(BufferGetPage(wbuf), recptr);
+					PageSetLSN(BufferGetPage(rbuf), recptr);
 
 					END_CRIT_SECTION();
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8e220a3ae..263bc73f1 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -630,6 +630,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		lowmask;
 	bool		metap_update_masks = false;
 	bool		metap_update_splitpoint = false;
+	XLogRecPtr	recptr;
 
 restart_expand:
 
@@ -900,7 +901,6 @@ restart_expand:
 	if (RelationNeedsWAL(rel))
 	{
 		xl_hash_split_allocate_page xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.new_bucket = maxbucket;
 		xlrec.old_bucket_flag = oopaque->hasho_flag;
@@ -933,11 +933,13 @@ restart_expand:
 		XLogRegisterData(&xlrec, SizeOfHashSplitAllocPage);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
-
-		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
-		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
-		PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+	PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+	PageSetLSN(BufferGetPage(metabuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -1092,6 +1094,7 @@ _hash_splitbucket(Relation rel,
 	Size		all_tups_size = 0;
 	int			i;
 	uint16		nitups = 0;
+	XLogRecPtr	recptr;
 
 	bucket_obuf = obuf;
 	opage = BufferGetPage(obuf);
@@ -1296,7 +1299,6 @@ _hash_splitbucket(Relation rel,
 
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_hash_split_complete xlrec;
 
 		xlrec.old_bucket_flag = oopaque->hasho_flag;
@@ -1310,10 +1312,12 @@ _hash_splitbucket(Relation rel,
 		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
-
-		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
-		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+	PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
 
 	END_CRIT_SECTION();
 
-- 
2.53.0



  [application/octet-stream] v13-0001-Extract-fake-LSN-infrastructure-from-GiST-index-.patch (16.6K, 5-v13-0001-Extract-fake-LSN-infrastructure-from-GiST-index-.patch)
  download | inline diff:
From 459c5f6dc411b40a8ad9a546fa06e799d00a3118 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:18:11 -0500
Subject: [PATCH v13 01/19] Extract fake LSN infrastructure from GiST index AM.

Extract utility functions used by GiST to generate fake LSNs so that
other index AMs can reuse this infrastructure to generate fake LSNs.

Preparation for an upcoming commit that will change the rules around
holding on to buffer pins on leaf pages in unlogged nbtree indexes
(actually, in all cases barring scans that use a non-MVCC snapshot).
This is the patch that will add the new amgetbatch interface.  Another
preparatory commit will add fake LSN support to nbtree ahead of the
amgetbatch commit.

Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
---
 src/include/access/gist_private.h       |  4 --
 src/include/access/gistxlog.h           |  2 +-
 src/include/access/xlog.h               |  1 +
 src/include/access/xloginsert.h         |  2 +
 src/include/catalog/pg_control.h        |  2 +-
 src/backend/access/gist/gist.c          |  6 +--
 src/backend/access/gist/gistutil.c      | 50 -------------------
 src/backend/access/gist/gistvacuum.c    |  8 +--
 src/backend/access/gist/gistxlog.c      | 21 --------
 src/backend/access/rmgrdesc/gistdesc.c  |  6 ---
 src/backend/access/rmgrdesc/xlogdesc.c  |  7 +++
 src/backend/access/transam/xlog.c       | 28 +++++++++++
 src/backend/access/transam/xloginsert.c | 65 +++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c     |  6 +--
 14 files changed, 115 insertions(+), 93 deletions(-)

diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 552f605c0..44514f1cb 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -457,8 +457,6 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								BlockNumber origrlink, GistNSN orignsn,
 								Buffer leftchildbuf, bool markfollowright);
 
-extern XLogRecPtr gistXLogAssignLSN(void);
-
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
@@ -531,8 +529,6 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
-
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index d3d1c6549..1c2cf6e81 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,7 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX		 0x50 */	/* not used anymore */
 #define XLOG_GIST_PAGE_DELETE		0x60
-#define XLOG_GIST_ASSIGN_LSN		0x70	/* nop, assign new LSN */
+ /* #define XLOG_GIST_ASSIGN_LSN		 0x70 */	/* not used anymore */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index fdfb57246..553d6fc9c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -258,6 +258,7 @@ extern bool CreateRestartPoint(int flags);
 extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
+extern XLogRecPtr XLogAssignLSN(void);
 extern void UpdateFullPageWrites(void);
 extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
 extern XLogRecPtr GetRedoRecPtr(void);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 16ebc76e7..91dfbd562 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -64,6 +64,8 @@ extern void log_newpage_range(Relation rel, ForkNumber forknum,
 							  BlockNumber startblk, BlockNumber endblk, bool page_std);
 extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
 
+extern XLogRecPtr XLogGetFakeLSN(Relation rel);
+
 extern void InitXLogInsert(void);
 
 #endif							/* XLOGINSERT_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 7503db1af..77a661e81 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,7 +78,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY			0x90
 #define XLOG_FPI_FOR_HINT				0xA0
 #define XLOG_FPI						0xB0
-/* 0xC0 is used in Postgres 9.5-11 */
+#define XLOG_ASSIGN_LSN					0xC0
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
 #define XLOG_CHECKPOINT_REDO			0xE0
 #define XLOG_LOGICAL_DECODING_STATUS_CHANGE	0xF0
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index dfffce3e3..8565e225b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -517,7 +517,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = XLogGetFakeLSN(rel);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -594,7 +594,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = XLogGetFakeLSN(rel);
 		}
 		PageSetLSN(page, recptr);
 
@@ -1733,7 +1733,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, XLogGetFakeLSN(rel));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 27972fad2..0f58f6187 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1007,56 +1007,6 @@ gistproperty(Oid index_oid, int attno,
 	return true;
 }
 
-/*
- * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
- * splits anyway. This function provides a fake sequence of LSNs for that
- * purpose.
- */
-XLogRecPtr
-gistGetFakeLSN(Relation rel)
-{
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
-	{
-		/*
-		 * Temporary relations are only accessible in our session, so a simple
-		 * backend-local counter will do.
-		 */
-		static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
-		return counter++;
-	}
-	else if (RelationIsPermanent(rel))
-	{
-		/*
-		 * WAL-logging on this relation will start after commit, so its LSNs
-		 * must be distinct numbers smaller than the LSN at the next commit.
-		 * Emit a dummy WAL record if insert-LSN hasn't advanced after the
-		 * last call.
-		 */
-		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
-		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
-
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
-
-		/* No need for an actual record if we already have a distinct LSN */
-		if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
-			currlsn = gistXLogAssignLSN();
-
-		lastlsn = currlsn;
-		return currlsn;
-	}
-	else
-	{
-		/*
-		 * Unlogged relations are accessible from other backends, and survive
-		 * (clean) restarts. GetFakeLSNForUnloggedRel() handles that for us.
-		 */
-		Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
-		return GetFakeLSNForUnloggedRel();
-	}
-}
-
 /*
  * This is a stratnum translation support function for GiST opclasses that use
  * the RT*StrategyNumber constants.
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 9e714980d..686a04180 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,7 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
-#include "access/transam.h"
+#include "access/xloginsert.h"
 #include "commands/vacuum.h"
 #include "lib/integerset.h"
 #include "miscadmin.h"
@@ -182,7 +182,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = XLogGetFakeLSN(rel);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -413,7 +413,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, XLogGetFakeLSN(rel));
 
 			END_CRIT_SECTION();
 
@@ -707,7 +707,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = XLogGetFakeLSN(info->index);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index c78383849..ae538dc81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -421,9 +421,6 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_PAGE_DELETE:
 			gistRedoPageDelete(record);
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			/* nop. See gistGetFakeLSN(). */
-			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -567,24 +564,6 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
 	return recptr;
 }
 
-/*
- * Write an empty XLOG record to assign a distinct LSN.
- */
-XLogRecPtr
-gistXLogAssignLSN(void)
-{
-	int			dummy = 0;
-
-	/*
-	 * Records other than XLOG_SWITCH must have content. We use an integer 0
-	 * to follow the restriction.
-	 */
-	XLogBeginInsert();
-	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData(&dummy, sizeof(dummy));
-	return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
-}
-
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 79a839cc2..67789e025 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,9 +80,6 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_GIST_PAGE_DELETE:
 			out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			/* No details to write out */
-			break;
 	}
 }
 
@@ -108,9 +105,6 @@ gist_identify(uint8 info)
 		case XLOG_GIST_PAGE_DELETE:
 			id = "PAGE_DELETE";
 			break;
-		case XLOG_GIST_ASSIGN_LSN:
-			id = "ASSIGN_LSN";
-			break;
 	}
 
 	return id;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index ff078f222..9044b9521 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -175,6 +175,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		memcpy(&enabled, rec, sizeof(bool));
 		appendStringInfoString(buf, enabled ? "true" : "false");
 	}
+	else if (info == XLOG_ASSIGN_LSN)
+	{
+		/* no further information to print */
+	}
 }
 
 const char *
@@ -229,6 +233,9 @@ xlog_identify(uint8 info)
 		case XLOG_LOGICAL_DECODING_STATUS_CHANGE:
 			id = "LOGICAL_DECODING_STATUS_CHANGE";
 			break;
+		case XLOG_ASSIGN_LSN:
+			id = "ASSIGN_LSN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b9b678f37..92e44a501 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8224,6 +8224,30 @@ XLogRestorePoint(const char *rpName)
 	return RecPtr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ *
+ * This is used by some index AMs when building indexes on permanent relations
+ * with wal_level=minimal.  In that scenario, WAL-logging will start after
+ * commit, but the index AM needs distinct LSNs to detect concurrent page
+ * modifications.  When the current WAL insert position hasn't advanced since
+ * the last call, we emit a dummy record to ensure we get a new, distinct LSN.
+ */
+XLogRecPtr
+XLogAssignLSN(void)
+{
+	int			dummy = 0;
+
+	/*
+	 * Records other than XLOG_SWITCH must have content.  We use an integer 0
+	 * to satisfy this restriction.
+	 */
+	XLogBeginInsert();
+	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+	XLogRegisterData(&dummy, sizeof(dummy));
+	return XLogInsert(RM_XLOG_ID, XLOG_ASSIGN_LSN);
+}
+
 /*
  * Check if any of the GUC parameters that are critical for hot standby
  * have changed, and update the value in pg_control file if necessary.
@@ -8591,6 +8615,10 @@ xlog_redo(XLogReaderState *record)
 	{
 		/* nothing to do here, handled in xlogrecovery.c */
 	}
+	else if (info == XLOG_ASSIGN_LSN)
+	{
+		/* nothing to do here, see XLogGetFakeLSN() */
+	}
 	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
 	{
 		/*
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index ac3c1a783..4e049982f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -41,6 +41,7 @@
 #include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
+#include "utils/rel.h"
 
 /*
  * Guess the maximum buffer size required to store a compressed version of
@@ -547,6 +548,70 @@ XLogSimpleInsertInt64(RmgrId rmid, uint8 info, int64 value)
 	return XLogInsert(rmid, info);
 }
 
+/*
+ * XLogGetFakeLSN - get a fake LSN for an index page that isn't WAL-logged.
+ *
+ * Some index AMs use LSNs to detect concurrent page modifications, but not
+ * all index pages are WAL-logged.  This function provides a sequence of fake
+ * LSNs for that purpose.
+ *
+ * The behavior depends on the relation's persistence:
+ *
+ * - For temporary relations, we use a simple backend-local counter since
+ *   temporary relations are only accessible within our session.
+ *
+ * - For permanent relations when WAL-logging is disabled (e.g., during index
+ *   creation with wal_level=minimal), we use the current WAL insert position.
+ *   If the insert position hasn't advanced since the last call, we emit a
+ *   dummy WAL record via XLogAssignLSN() to ensure we get a distinct LSN.
+ *
+ * - For unlogged relations, we use the global fake LSN counter maintained
+ *   by GetFakeLSNForUnloggedRel().
+ */
+XLogRecPtr
+XLogGetFakeLSN(Relation rel)
+{
+	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	{
+		/*
+		 * Temporary relations are only accessible in our session, so a simple
+		 * backend-local counter will do.
+		 */
+		static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
+		return counter++;
+	}
+	else if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
+	{
+		/*
+		 * Unlogged relations are accessible from other backends, and survive
+		 * (clean) restarts.  GetFakeLSNForUnloggedRel() handles that for us.
+		 */
+		return GetFakeLSNForUnloggedRel();
+	}
+	else
+	{
+		/*
+		 * WAL-logging on this relation will start after commit, so its LSNs
+		 * must be distinct numbers smaller than the LSN at the next commit.
+		 * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+		 * last call.
+		 */
+		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
+
+		Assert(!RelationNeedsWAL(rel));
+		Assert(RelationIsPermanent(rel));
+
+		/* No need for an actual record if we already have a distinct LSN */
+		if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
+			currlsn = XLogAssignLSN();
+
+		lastlsn = currlsn;
+		return currlsn;
+	}
+}
+
 /*
  * Assemble a WAL record from the registered data and buffers into an
  * XLogRecData chain, ready for insertion with XLogInsertRecord().
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f30a2537..00bc60952 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4462,9 +4462,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
 	 * LSNs since we never emit WAL records for them, and therefore flushing
 	 * up through the buffer LSN would be useless, but harmless.  However,
-	 * GiST indexes use LSNs internally to track page-splits, and therefore
-	 * unlogged GiST pages bear "fake" LSNs generated by
-	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * some index AMs use LSNs internally to detect concurrent page
+	 * modifications, and therefore unlogged index pages bear "fake" LSNs
+	 * generated by XLogGetFakeLSN.  It is unlikely but possible that the fake
 	 * LSN counter could advance past the WAL insertion point; and if it did
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
-- 
2.53.0



  [application/octet-stream] v13-0017-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch (4.1K, 6-v13-0017-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch)
  download | inline diff:
From a425314faeeb06ad4f884aaf487a9b3ed286f1af Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 20:23:55 -0500
Subject: [PATCH v13 17/19] WIP: aio: io_uring: Use IO size not IO queue to
 trigger async processing

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/method_io_uring.c | 57 ++++++++++++++---------
 1 file changed, 35 insertions(+), 22 deletions(-)

diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 4e1244fa1..041b72629 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -407,7 +407,6 @@ static int
 pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
-	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -423,27 +422,6 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 		pgaio_io_prepare_submit(ioh);
 		pgaio_uring_sq_from_io(ioh, sqe);
-
-		/*
-		 * io_uring executes IO in process context if possible. That's
-		 * generally good, as it reduces context switching. When performing a
-		 * lot of buffered IO that means that copying between page cache and
-		 * userspace memory happens in the foreground, as it can't be
-		 * offloaded to DMA hardware as is possible when using direct IO. When
-		 * executing a lot of buffered IO this causes io_uring to be slower
-		 * than worker mode, as worker mode parallelizes the copying. io_uring
-		 * can be told to offload work to worker threads instead.
-		 *
-		 * If an IO is buffered IO and we already have IOs in flight or
-		 * multiple IOs are being submitted, we thus tell io_uring to execute
-		 * the IO in the background. We don't do so for the first few IOs
-		 * being submitted as executing in this process' context has lower
-		 * latency.
-		 */
-		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
-			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
-
-		in_flight_before++;
 	}
 
 	while (true)
@@ -707,6 +685,7 @@ static void
 pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 {
 	struct iovec *iov;
+	size_t		io_size = 0;
 
 	switch ((PgAioOp) ioh->op)
 	{
@@ -719,6 +698,8 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 								   iov->iov_base,
 								   iov->iov_len,
 								   ioh->op_data.read.offset);
+
+				io_size = iov->iov_len;
 			}
 			else
 			{
@@ -728,7 +709,39 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 									ioh->op_data.read.iov_length,
 									ioh->op_data.read.offset);
 
+				for (int i = 0; i <= ioh->op_data.read.iov_length; i++, iov++)
+					io_size += iov->iov_len;
 			}
+
+
+			/*
+			 * io_uring executes IO in process context if possible. That's
+			 * generally good, as it reduces context switching. When
+			 * performing a lot of buffered IO that means that copying between
+			 * page cache and userspace memory happens in the foreground, as
+			 * it can't be offloaded to DMA hardware as is possible when using
+			 * direct IO. When executing a lot of buffered IO this causes
+			 * io_uring to be slower than worker mode, as worker mode
+			 * parallelizes the copying. io_uring can be told to offload work
+			 * to worker threads instead.
+			 *
+			 * If the IOs are small, there is no benefit from forcing things
+			 * into the background, the overhead from context switching is
+			 * higher than the gain.  Therefore we use the size of the read as
+			 * a heuristic.
+			 *
+			 * XXX: We used to not do this for the first few IOs in flight,
+			 * but now we have a heuristic preventing deeper IO queues if IOs
+			 * finish in time, which will often prevent us from ever reaching
+			 * that deep queues.  Maybe there's a better way?
+			 *
+			 * XXX: Need to evaluate the number of blocks when IOSQE_ASYNC
+			 * starts to make sense.
+			 */
+			if (io_size >= (BLCKSZ * 4) &&
+				(ioh->flags & PGAIO_HF_BUFFERED))
+				io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
 			break;
 
 		case PGAIO_OP_WRITEV:
-- 
2.53.0



  [application/octet-stream] v13-0015-WIP-read_stream-Only-increase-distance-when-wait.patch (2.3K, 7-v13-0015-WIP-read_stream-Only-increase-distance-when-wait.patch)
  download | inline diff:
From 82c33c2363e6caa3f488fd92782a0c24157bf637 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 18:00:53 -0500
Subject: [PATCH v13 15/19] WIP: read_stream: Only increase distance when
 waiting for IO

This avoids increasing the distance to the maximum in cases where the IO
subsystem is already keeping up.

TODO: This might be too aggressive without a subsequent patch that reduces how
often we decrease the distance.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 3667d67ab..e28ab5de0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -931,22 +931,36 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	{
 		int16		io_index = stream->oldest_io_index;
 		int32		distance;	/* wider temporary value, clamped below */
+		bool		needed_wait;
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		WaitReadBuffers(&stream->ios[io_index].op);
+		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		/*
+		 * Look-ahead distance ramps up rapidly after we needed to wait for
+		 * IO. We only increase the distance when we needed to wait, to avoid
+		 * increasing the distance further than necessary, as unnecessarily
+		 * pinning many buffers can be costly.
+		 *
+		 * NB: May not increase the distance if we reached the end of the
+		 * stream.
+		 */
+		if (stream->distance > 0 && needed_wait)
+		{
+			distance = stream->distance * 2;
+			if (distance && distance < PG_INT16_MAX)
+				distance++;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
 
 		/*
 		 * If we've reached the first block of a sequential region we're
-- 
2.53.0



  [application/octet-stream] v13-0013-WIP-aio-io_uring-Allow-IO-methods-to-check-if-IO.patch (4.6K, 8-v13-0013-WIP-aio-io_uring-Allow-IO-methods-to-check-if-IO.patch)
  download | inline diff:
From a2330fd7539188f1c5bf0b112da67217ef9cb4c9 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:40:35 -0500
Subject: [PATCH v13 13/19] WIP: aio: io_uring: Allow IO methods to check if IO
 completed in the background

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h        | 15 ++++++++
 src/backend/storage/aio/aio.c             | 15 ++++++++
 src/backend/storage/aio/method_io_uring.c | 47 +++++++++++++++++++++++
 3 files changed, 77 insertions(+)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 5feea15be..33e1e2dc0 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -328,6 +328,21 @@ typedef struct IoMethodOps
 	 */
 	void		(*wait_one) (PgAioHandle *ioh,
 							 uint64 ref_generation);
+
+	/* ---
+	 * Check if IO has already completed. Optional.
+	 *
+	 * Some IO methods need to poll a kernel object to see if IO has already
+	 * completed in the background. This callback allows to do so.
+	 *
+	 * This callback may not wait for IO to complete, however it is allowed,
+	 * although not desirable, to wait for short-lived locks. It is ok from a
+	 * correctness perspective to not process any/all available completions,
+	 * it just can lead to inferior performance.
+	 * ---
+	 */
+	void		(*check_one) (PgAioHandle *ioh,
+							  uint64 ref_generation);
 } IoMethodOps;
 
 
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e4ae3031f..4e742038d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1019,6 +1019,21 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 
 	am_owner = ioh->owner_procno == MyProcNumber;
 
+	/*
+	 * If the IO is not executing synchronously, allow the IO method to check
+	 * if the IO already has completed.
+	 */
+	if (pgaio_method_ops->check_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+	{
+		pgaio_method_ops->check_one(ioh, ref_generation);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return true;
+
+		if (state == PGAIO_HS_IDLE)
+			return true;
+	}
+
 	if (state == PGAIO_HS_COMPLETED_SHARED ||
 		state == PGAIO_HS_COMPLETED_LOCAL)
 	{
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index ed6e71bcd..4e1244fa1 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -54,6 +54,7 @@ static void pgaio_uring_shmem_init(bool first_time);
 static void pgaio_uring_init_backend(void);
 static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
 static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+static void pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation);
 
 /* helper functions */
 static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
@@ -75,6 +76,7 @@ const IoMethodOps pgaio_uring_ops = {
 
 	.submit = pgaio_uring_submit,
 	.wait_one = pgaio_uring_wait_one,
+	.check_one = pgaio_uring_check_one,
 };
 
 /*
@@ -656,6 +658,51 @@ pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
 				waited);
 }
 
+static void
+pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	int			waited = 0;
+
+	/*
+	 * This check is not reliable when not holding the completion lock, but
+	 * it's a useful cheap pre-check to see if it's worth trying to get the
+	 * completion lock.
+	 */
+	if (!io_uring_cq_ready(&owner_context->io_uring_ring))
+		return;
+
+	/*
+	 * If the completion lock is currently held, the holder will likely
+	 * process any pending completions, give up.
+	 */
+	if (!LWLockConditionalAcquire(&owner_context->completion_lock, LW_EXCLUSIVE))
+		return;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "check_one io_gen: %" PRIu64 ", ref_gen: %" PRIu64 ", cycle %d",
+				   ioh->generation,
+				   ref_generation,
+				   waited);
+
+	/*
+	 * Recheck if there are any completions, another backend could have
+	 * processed them since we checked above, or our unlocked pre-check could
+	 * have been reading outdated values.
+	 *
+	 * It is possible that the IO handle has been reused since the start of
+	 * the call, but now that we have the lock, we can just as well drain all
+	 * completions.
+	 */
+	if (io_uring_cq_ready(&owner_context->io_uring_ring))
+	{
+		pgaio_uring_drain_locked(owner_context);
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+}
+
 static void
 pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 {
-- 
2.53.0



  [application/octet-stream] v13-0014-bufmgr-Return-whether-WaitReadBuffers-needed-to-.patch (2.8K, 9-v13-0014-bufmgr-Return-whether-WaitReadBuffers-needed-to-.patch)
  download | inline diff:
From de1a37d6acabc89e641fd60bc060f09cf463dc3f Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:50:50 -0500
Subject: [PATCH v13 14/19] bufmgr: Return whether WaitReadBuffers() needed to
 wait

In a subsequent commit read_stream.c will use this as an input to the read
ahead distance.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  2 +-
 src/backend/storage/buffer/bufmgr.c | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8e1baf691..33bd4802c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -250,7 +250,7 @@ extern bool StartReadBuffers(ReadBuffersOperation *operation,
 							 BlockNumber blockNum,
 							 int *nblocks,
 							 int flags);
-extern void WaitReadBuffers(ReadBuffersOperation *operation);
+extern bool WaitReadBuffers(ReadBuffersOperation *operation);
 
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c48ae1f8..0806505ff 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1731,12 +1731,20 @@ ProcessReadBuffersResult(ReadBuffersOperation *operation)
 	Assert(operation->nblocks_done <= operation->nblocks);
 }
 
-void
+/*
+ * Wait for the IO operation initiated by StartReadBuffers() et al to
+ * complete.
+ *
+ * Returns whether the IO operation already had completed by the time of this
+ * call.
+ */
+bool
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
 	PgAioReturn *aio_ret = &operation->io_return;
 	IOContext	io_context;
 	IOObject	io_object;
+	bool		needed_wait = false;
 
 	if (operation->persistence == RELPERSISTENCE_TEMP)
 	{
@@ -1798,6 +1806,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
 
 				pgaio_wref_wait(&operation->io_wref);
+				needed_wait = true;
 
 				/*
 				 * The IO operation itself was already counted earlier, in
@@ -1850,6 +1859,12 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * If the IO completed only partially, we need to perform additional
+		 * work, consider that a form of having had to wait.
+		 */
+		needed_wait = true;
+
 		/*
 		 * This may only complete the IO partially, either because some
 		 * buffers were already valid, or because of a partial read.
@@ -1866,6 +1881,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	CheckReadBuffersOperation(operation, true);
 
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
+	return needed_wait;
 }
 
 /*
-- 
2.53.0



  [application/octet-stream] v13-0012-WIP-read_stream-Issue-IO-synchronously-while-in-.patch (2.0K, 10-v13-0012-WIP-read_stream-Issue-IO-synchronously-while-in-.patch)
  download | inline diff:
From da79f0b2e65f5262e986d300ba94fc1d2a64e73c Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:25:41 -0500
Subject: [PATCH v13 12/19] WIP: read_stream: Issue IO synchronously while in
 fast path

While in fast-path, execute any IO that we might encounter
synchronously. Because we are, right now, not reading ahead, dispatching any
occasional IO to workers would have the overhead of dispatching to workers,
without any realistic chance of the IO completing before we need it.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index cd54c1a74..3667d67ab 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -833,6 +833,19 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (stream->advice_enabled)
 				flags |= READ_BUFFERS_ISSUE_ADVICE;
 
+			/*
+			 * While in fast-path, execute any IO that we might encounter
+			 * synchronously. Because we are, right now, not reading ahead,
+			 * dispatching any occasional IO to workers would have the
+			 * overhead of dispatching to workers, without any realistic
+			 * chance of the IO completing before we need it. We will switch
+			 * to non-synchronous IO after this.
+			 *
+			 * XXX: Should we do so only for worker, or also io_uring? There's
+			 * not much dispatch overhead with io_uring, compared to worker...
+			 */
+			flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -860,6 +873,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios_in_progress = 1;
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
+
+			/* FIXME: it would probably worth issuing readahead here */
 		}
 		else
 		{
-- 
2.53.0



  [application/octet-stream] v13-0011-Don-t-wait-for-already-in-progress-IO.patch (20.6K, 11-v13-0011-Don-t-wait-for-already-in-progress-IO.patch)
  download | inline diff:
From 8ee8ea2673f4a22319a45293a4c3e5188494e899 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 14:00:31 -0500
Subject: [PATCH v13 11/19] Don't wait for already in-progress IO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a backend attempts to start a read on a buffer and finds that I/O
is already in progress, it previously waited for that I/O to complete
before initiating reads for any other buffers. Although the backend must
still wait for the I/O to finish when later acquiring the buffer, it
should not need to wait at read start time. Other buffers may be
available for I/O, and in some workloads this waiting significantly
reduces concurrency.

For example, index scans may repeatedly request the same heap block. If
the backend waits each time it encounters an in-progress read, the
access pattern effectively degenerates into synchronous I/O. By
introducing the concept of foreign I/O operations, a backend can record
the buffer’s wait reference and defer waiting until WaitReadBuffers()
when it actually acquires the buffer.

In rare cases, a backend may still need to wait when starting a read if
it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
but before the buffer descriptor’s wait reference has been set. Such
windows should be brief and uncommon.
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 491 ++++++++++++++++++----------
 src/tools/pgindent/typedefs.list    |   1 +
 3 files changed, 325 insertions(+), 168 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4017896f9..8e1baf691 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 70a5dba73..3c48ae1f8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -185,6 +185,21 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
+
+/*
+ * In AsyncReadBuffers(), when preparing a buffer for reading and setting
+ * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
+ * already contain the desired block. AsyncReadBuffers() must distinguish
+ * between these cases (and the case where it should initiate I/O) so it can
+ * mark an in-progress buffer as foreign I/O rather than waiting on it.
+ */
+typedef enum PrepareReadBuffer_Status
+{
+	READ_BUFFER_ALREADY_DONE,
+	READ_BUFFER_IN_PROGRESS,
+	READ_BUFFER_READY_FOR_IO,
+} PrepareReadBuffer_Status;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1628,45 +1643,6 @@ CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
 #endif
 }
 
-/* helper for ReadBuffersCanStartIO(), to avoid repetition */
-static inline bool
-ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
-{
-	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
-								  true, nowait);
-	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
-}
-
-/*
- * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
- */
-static inline bool
-ReadBuffersCanStartIO(Buffer buffer, bool nowait)
-{
-	/*
-	 * If this backend currently has staged IO, we need to submit the pending
-	 * IO before waiting for the right to issue IO, to avoid the potential for
-	 * deadlocks (and, more commonly, unnecessary delays for other backends).
-	 */
-	if (!nowait && pgaio_have_staged())
-	{
-		if (ReadBuffersCanStartIOOnce(buffer, true))
-			return true;
-
-		/*
-		 * Unfortunately StartBufferIO() returning false doesn't allow to
-		 * distinguish between the buffer already being valid and IO already
-		 * being in progress. Since IO already being in progress is quite
-		 * rare, this approach seems fine.
-		 */
-		pgaio_submit_staged();
-	}
-
-	return ReadBuffersCanStartIOOnce(buffer, nowait);
-}
-
 /*
  * We track various stats related to buffer hits. Because this is done in a
  * few separate places, this helper exists for convenience.
@@ -1816,7 +1792,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1835,11 +1811,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc = BufferIsLocal(buffer) ?
+					GetLocalBufferDescriptor(-buffer - 1) :
+					GetBufferDescriptor(buffer - 1);
+				uint32		buf_state = pg_atomic_read_u64(&desc->state);
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					ProcessBufferHit(operation->strategy,
+									 operation->rel, operation->persistence,
+									 operation->smgr, operation->forknum,
+									 operation->blocknum + operation->nblocks_done);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1870,6 +1868,159 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	/* NB: READ_DONE tracepoint was already executed in completion callback */
 }
 
+/*
+ * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
+ * avoid an external function call.
+ */
+static PrepareReadBuffer_Status
+PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
+							Buffer buffer)
+{
+	BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
+	uint64		buf_state = pg_atomic_read_u64(&desc->state);
+
+	/* Already valid, no work to do */
+	if (buf_state & BM_VALID)
+	{
+		pgaio_wref_clear(&operation->io_wref);
+		return READ_BUFFER_ALREADY_DONE;
+	}
+
+	pgaio_submit_staged();
+
+	if (pgaio_wref_valid(&desc->io_wref))
+	{
+		operation->io_wref = desc->io_wref;
+		operation->foreign_io = true;
+		return READ_BUFFER_IN_PROGRESS;
+	}
+
+	return READ_BUFFER_READY_FOR_IO;
+}
+
+/*
+ * Try to start IO on the first buffer in a new run of blocks. If AIO is in
+ * progress, be it in this backend or another backend, we just associate the
+ * wait reference with the operation and wait in WaitReadBuffers(). This turns
+ * out to be important for performance in two workloads:
+ *
+ * 1) A read stream that has to read the same block multiple times within the
+ *    readahead distance. This can happen e.g. for the table accesses of an
+ *    index scan.
+ *
+ * 2) Concurrent scans by multiple backends on the same relation.
+ *
+ * If we were to synchronously wait for the in-progress IO, we'd not be able
+ * to keep enough I/O in flight.
+ *
+ * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+ * ReadBuffersOperation that WaitReadBuffers then can wait on.
+ *
+ * It's possible that another backend has started IO on the buffer but not yet
+ * set its wait reference. In this case, we have no choice but to wait for
+ * either the wait reference to be valid or the IO to be done.
+ */
+static PrepareReadBuffer_Status
+PrepareNewReadBufferIO(ReadBuffersOperation *operation,
+					   Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+		return PrepareNewLocalReadBufferIO(operation, buffer);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	desc = GetBufferDescriptor(buffer - 1);
+
+	for (;;)
+	{
+		buf_state = LockBufHdr(desc);
+
+		/* Already valid, no work to do */
+		if (buf_state & BM_VALID)
+		{
+			UnlockBufHdr(desc);
+			pgaio_wref_clear(&operation->io_wref);
+			return READ_BUFFER_ALREADY_DONE;
+		}
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			/* Join existing read */
+			if (pgaio_wref_valid(&desc->io_wref))
+			{
+				operation->io_wref = desc->io_wref;
+				operation->foreign_io = true;
+				UnlockBufHdr(desc);
+				return READ_BUFFER_IN_PROGRESS;
+			}
+
+			/*
+			 * If the wait ref is not valid but the IO is in progress, someone
+			 * else started IO but hasn't set the wait ref yet. We have no
+			 * choice but to wait until the wait ref is set or the IO
+			 * completes.
+			 */
+			UnlockBufHdr(desc);
+			pgaio_submit_staged();
+			WaitIO(desc);
+			continue;
+		}
+
+		/*
+		 * No IO in progress and not already valid; We will start IO. It's
+		 * possible that the IO was in progress and never became valid because
+		 * the IO errored out. We'll do the IO ourselves.
+		 */
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+									  BufferDescriptorGetBuffer(desc));
+
+		return READ_BUFFER_READY_FOR_IO;
+	}
+}
+
+
+/*
+ * When building a new IO from multiple buffers, we won't include buffers
+ * that are already valid or already in progress. This function should only be
+ * used for additional adjacent buffers following the head buffer in a new IO.
+ *
+ * Returns true if the buffer was successfully prepared for IO and false if it
+ * is rejected and the read IO should not include this buffer.
+*/
+static bool
+PrepareAdditionalReadBuffer(Buffer buffer)
+{
+	uint64		buf_state;
+	BufferDesc *desc;
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u64(&desc->state);
+		/* Local buffers don't use BM_IO_IN_PROGRESS */
+		if (buf_state & BM_VALID || pgaio_wref_valid(&desc->io_wref))
+			return false;
+	}
+	else
+	{
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+		if (buf_state & (BM_VALID | BM_IO_IN_PROGRESS))
+		{
+			UnlockBufHdr(desc);
+			return false;
+		}
+		UnlockBufHdrExt(desc, buf_state, BM_IO_IN_PROGRESS, 0, 0);
+		ResourceOwnerRememberBufferIO(CurrentResourceOwner, buffer);
+	}
+
+	return true;
+}
+
 /*
  * Initiate IO for the ReadBuffersOperation
  *
@@ -1903,7 +2054,75 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		did_start_io;
+	instr_time	io_start;
+	PrepareReadBuffer_Status status;
+
+	/*
+	 * We must get an IO handle before StartNewBufferReadIO(), as
+	 * pgaio_io_acquire() might block, which we don't want after setting
+	 * IO_IN_PROGRESS. If we don't need to do the IO, we'll release the
+	 * handle.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	operation->foreign_io = false;
+
+	/* Check if we can start IO on the first to-be-read buffer */
+	if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
+		READ_BUFFER_READY_FOR_IO)
+	{
+		pgaio_io_release(ioh);
+		*nblocks_progress = 1;
+		if (status == READ_BUFFER_ALREADY_DONE)
+		{
+			/*
+			 * Someone else has already completed this block, we're done.
+			 *
+			 * When IO is necessary, ->nblocks_done is updated in
+			 * ProcessReadBuffersResult(), but that is not called if no IO is
+			 * necessary. Thus update here.
+			 */
+			operation->nblocks_done += 1;
+			Assert(operation->nblocks_done <= operation->nblocks);
+
+			/*
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock(). The
+			 * other backend will track this as a 'read'.
+			 */
+			ProcessBufferHit(operation->strategy,
+							 operation->rel, operation->persistence,
+							 operation->smgr, operation->forknum,
+							 operation->blocknum + operation->nblocks_done);
+			return false;
+		}
+
+		/* The IO is already in-progress */
+		Assert(status == READ_BUFFER_IN_PROGRESS);
+		CheckReadBuffersOperation(operation, false);
+		return true;
+	}
+
+	/* We can read in at least the head buffer . */
+	Assert(status == READ_BUFFER_READY_FOR_IO);
 
 	/*
 	 * When this IO is executed synchronously, either because the caller will
@@ -1954,138 +2173,74 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	 */
 	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
 
-	/*
-	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
-	 * might block, which we don't want after setting IO_IN_PROGRESS.
-	 *
-	 * If we need to wait for IO before we can get a handle, submit
-	 * already-staged IO first, so that other backends don't need to wait.
-	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
-	 * wait for already submitted IO, which doesn't require additional locks,
-	 * but it could still cause undesirable waits.
-	 *
-	 * A secondary benefit is that this would allow us to measure the time in
-	 * pgaio_io_acquire() without causing undue timer overhead in the common,
-	 * non-blocking, case.  However, currently the pgstats infrastructure
-	 * doesn't really allow that, as it a) asserts that an operation can't
-	 * have time without operations b) doesn't have an API to report
-	 * "accumulated" time.
-	 */
-	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
-	if (unlikely(!ioh))
-	{
-		pgaio_submit_staged();
-
-		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
-	}
+	Assert(io_buffers[0] == buffers[nblocks_done]);
+	io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
+	io_buffers_len = 1;
 
 	/*
-	 * Check if we can start IO on the first to-be-read buffer.
-	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * How many neighboring-on-disk blocks can we scatter-read into other
+	 * buffers at the same time?  In this case we don't wait if we see an I/O
+	 * already in progress.  We already set BM_IO_IN_PROGRESS for the head
+	 * block, so we should get on with that I/O as soon as possible.
 	 */
-	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 	{
-		/*
-		 * Someone else has already completed this block, we're done.
-		 *
-		 * When IO is necessary, ->nblocks_done is updated in
-		 * ProcessReadBuffersResult(), but that is not called if no IO is
-		 * necessary. Thus update here.
-		 */
-		operation->nblocks_done += 1;
-		*nblocks_progress = 1;
+		if (!PrepareAdditionalReadBuffer(buffers[i]))
+			break;
+		/* Must be consecutive block numbers. */
+		Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+			   BufferGetBlockNumber(buffers[i]) - 1);
+		Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-		pgaio_io_release(ioh);
-		pgaio_wref_clear(&operation->io_wref);
-		did_start_io = false;
-
-		/*
-		 * Report and track this as a 'hit' for this backend, even though it
-		 * must have started out as a miss in PinBufferForBlock(). The other
-		 * backend will track this as a 'read'.
-		 */
-		ProcessBufferHit(operation->strategy, operation->rel, persistence,
-						 operation->smgr, forknum,
-						 blocknum + operation->nblocks_done);
+		io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 	}
+
+	/* get a reference to wait for in WaitReadBuffers() */
+	pgaio_io_get_wref(ioh, &operation->io_wref);
+
+	/* provide the list of buffers to the completion callbacks */
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+	pgaio_io_register_callbacks(ioh,
+								persistence == RELPERSISTENCE_TEMP ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								flags);
+
+	pgaio_io_set_flag(ioh, ioh_flags);
+
+	/* ---
+	* Even though we're trying to issue IO asynchronously, track the time
+	* in smgrstartreadv():
+	* - if io_method == IOMETHOD_SYNC, we will always perform the IO
+	*   immediately
+	* - the io method might not support the IO (e.g. worker IO for a temp
+	*   table)
+	* ---
+	*/
+	io_start = pgstat_prepare_io_time(track_io_timing);
+	smgrstartreadv(ioh, operation->smgr, forknum,
+				   blocknum + nblocks_done,
+				   io_pages, io_buffers_len);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+							io_start, 1, io_buffers_len * BLCKSZ);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += io_buffers_len;
 	else
-	{
-		instr_time	io_start;
+		pgBufferUsage.shared_blks_read += io_buffers_len;
 
-		/* We found a buffer that we need to read in. */
-		Assert(io_buffers[0] == buffers[nblocks_done]);
-		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
-		io_buffers_len = 1;
+	/*
+	 * Track vacuum cost when issuing IO, not after waiting for it. Otherwise
+	 * we could end up issuing a lot of IO in a short timespan, despite a low
+	 * cost limit.
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
-		/*
-		 * How many neighboring-on-disk blocks can we scatter-read into other
-		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
-		 * head block, so we should get on with that I/O as soon as possible.
-		 */
-		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
-		{
-			if (!ReadBuffersCanStartIO(buffers[i], true))
-				break;
-			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
-				   BufferGetBlockNumber(buffers[i]) - 1);
-			Assert(io_buffers[io_buffers_len] == buffers[i]);
+	*nblocks_progress = io_buffers_len;
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
-		}
-
-		/* get a reference to wait for in WaitReadBuffers() */
-		pgaio_io_get_wref(ioh, &operation->io_wref);
-
-		/* provide the list of buffers to the completion callbacks */
-		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
-
-		pgaio_io_register_callbacks(ioh,
-									persistence == RELPERSISTENCE_TEMP ?
-									PGAIO_HCB_LOCAL_BUFFER_READV :
-									PGAIO_HCB_SHARED_BUFFER_READV,
-									flags);
-
-		pgaio_io_set_flag(ioh, ioh_flags);
-
-		/* ---
-		 * Even though we're trying to issue IO asynchronously, track the time
-		 * in smgrstartreadv():
-		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
-		 *   immediately
-		 * - the io method might not support the IO (e.g. worker IO for a temp
-		 *   table)
-		 * ---
-		 */
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrstartreadv(ioh, operation->smgr, forknum,
-					   blocknum + nblocks_done,
-					   io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
-								io_start, 1, io_buffers_len * BLCKSZ);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
-		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
-
-		/*
-		 * Track vacuum cost when issuing IO, not after waiting for it.
-		 * Otherwise we could end up issuing a lot of IO in a short timespan,
-		 * despite a low cost limit.
-		 */
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
-
-		*nblocks_progress = io_buffers_len;
-		did_start_io = true;
-	}
-
-	return did_start_io;
+	return true;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d331aedfa..4d470a051 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2349,6 +2349,7 @@ PredicateLockData
 PredicateLockTargetType
 PrefetchBufferResult
 PrepParallelRestorePtrType
+PrepareReadBuffer_Status
 PrepareStmt
 PreparedStatement
 PresortedKeyData
-- 
2.53.0



  [application/octet-stream] v13-0007-Limit-get_actual_variable_range-to-scan-three-in.patch (9.4K, 12-v13-0007-Limit-get_actual_variable_range-to-scan-three-in.patch)
  download | inline diff:
From 9862c776caa934b7d1e4f954163bf44391e32b58 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Mon, 2 Feb 2026 23:34:36 -0500
Subject: [PATCH v13 07/19] Limit get_actual_variable_range to scan three index
 pages.

get_actual_variable_range scans an index to find actual min/max values
for planner selectivity estimation.  Since this happens during planning,
we can't afford to spend too much time on it.  Commit 9c6ad5eaa9 added
VISITED_PAGES_LIMIT (a limit of 100 heap page visits) to bound the
amount of work performed, giving up and falling back to the pg_statistic
extremal value when the limit is exceeded.  But that isn't effective in
cases with more extreme concentrations of dead index tuples.

Recent benchmark results from Mark Callaghan show that
VISITED_PAGES_LIMIT isn't effective once the dead index tuple problem
gets out of hand (which is expected with queue-like tables that
continually delete older records and insert newer ones).  The root cause
is that VISITED_PAGES_LIMIT counts heap page visits, but when many index
tuples are marked LP_DEAD, _bt_readpage traverses arbitrarily many index
pages without returning any tuples -- the heap page counter in
selfuncs.c never gets a chance to increment, so VISITED_PAGES_LIMIT
never triggers.  Furthermore, the design of setting LP_DEAD bits to help
future calls is ultimately counterproductive: each LP_DEAD tuple is one
fewer that counts against VISITED_PAGES_LIMIT, so the more LP_DEAD bits
we set, the less effective the limit becomes at bailing out early.

Replace VISITED_PAGES_LIMIT with a mechanism that limits
get_actual_variable_range to scanning only the extremal index leaf page,
and two additional index pages, rather than counting heap page visits.
This provides a hard guarantee on the maximum work per call.  Unlike
VISITED_PAGES_LIMIT, this limit cannot be eroded by LP_DEAD bits.

This approach also has the merit of being compatible with the index
prefetching commit's new table_index_getnext_slot() interface.  That
approach hides heap access details from callers like selfuncs.c, making
VISITED_PAGES_LIMIT impractical to implement without pushing ad-hoc
logic into the table AM layer.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzkt1WkKp4VRJu3qHfmKXc8W+XYv1RXg5d2d3fSvAeO=rg@mail.gmail.com
---
 src/include/access/relscan.h             |  8 ++++++
 src/backend/access/heap/heapam.c         |  3 ---
 src/backend/access/heap/heapam_handler.c | 11 +++++++++
 src/backend/access/index/genam.c         |  1 +
 src/backend/access/nbtree/nbtsearch.c    |  9 ++++++-
 src/backend/utils/adt/selfuncs.c         | 31 +++++++++++++++---------
 6 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ede5e6aa3..2db2b678c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -382,6 +382,14 @@ typedef struct IndexScanDescData
 
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
+
+	/*
+	 * Counter to request early abort during get_actual_variable_range scans.
+	 * When nonzero, the scan will read at most this many leaf pages before
+	 * giving up (regardless of whether those pages had matching items).
+	 * Zero means disabled (normal scan behavior).
+	 */
+	int			xs_read_extremal_only;
 } IndexScanDescData;
 
 /* Generic structure for parallel scans */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8f1c11a93..3cb536d6a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1884,9 +1884,6 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * If we can't see it, maybe no one else can either.  At caller
 		 * request, check whether all chain members are dead to all
 		 * transactions.
-		 *
-		 * Note: if you change the criterion here for what is "dead", fix the
-		 * planner's get_actual_variable_range() function to match.
 		 */
 		if (all_dead && *all_dead)
 		{
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 73ae4925e..fb25e203a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -557,6 +557,17 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
 
 		/* Append batch to the end of ring buffer/write it to buffer index */
 		index_scan_batch_append(scan, batch);
+
+		/*
+		 * xs_read_extremal_only scans are used by get_actual_variable_range
+		 * to find min/max values.  They only need a value from one of the
+		 * extremal leaf pages, so once we have one batch, we give up.
+		 */
+		if (unlikely(scan->xs_read_extremal_only) && priorBatch)
+		{
+			Assert(scan->xs_want_itup);
+			return NULL;
+		}
 	}
 	else
 	{
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 6e87169c2..d50e3fa71 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_itupdesc = NULL;
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
+	scan->xs_read_extremal_only = 0;
 
 	scan->batch_index_opaque_size = 0;
 	scan->batch_tuples_workspace = 0;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9c6f605..2b7f5c3b1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1702,13 +1702,18 @@ _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
 	Assert(firstbatch->dir == dir);
 
 	if (blkno == P_NONE ||
+		(scan->xs_read_extremal_only && --scan->xs_read_extremal_only == 0) ||
 		(ScanDirectionIsForward(dir) ?
 		 !btfirstbatch->moreRight : !btfirstbatch->moreLeft))
 	{
 		/*
 		 * firstbatch _bt_readpage call ended scan in this direction (though
-		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 * if so->needPrimScan was set the scan will continue in _bt_first).
+		 *
+		 * Also cut our losses during xs_read_extremal_only scans, which are
+		 * limited to scanning only a few leaf pages in the index.
 		 */
+		Assert(!scan->xs_read_extremal_only || !so->needPrimScan);
 		indexam_util_batch_release(scan, firstbatch);
 		_bt_parallel_done(scan);
 		return NULL;
@@ -1828,6 +1833,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		/* Continue the scan in this direction? */
 		if (blkno == P_NONE ||
+			(scan->xs_read_extremal_only &&
+			 --scan->xs_read_extremal_only == 0) ||
 			(ScanDirectionIsForward(dir) ?
 			 !btnewbatch->moreRight : !btnewbatch->moreLeft))
 		{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6d80ae003..09f2b9652 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7081,13 +7081,12 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
  *
  * scankeys is a 1-element scankey array set up to reject nulls.
  * typLen/typByVal describe the datatype of the index's first column.
- * tableslot is a slot suitable to hold table tuples, in case we need
- * to probe the heap.
+ * tableslot is a slot suitable to hold table tuples.
  * (We could compute these values locally, but that would mean computing them
  * twice when get_actual_variable_range needs both the min and the max.)
  *
- * Failure occurs either when the index is empty, or we decide that it's
- * taking too long to find a suitable tuple.
+ * Failure occurs either when the index is empty, or when it takes too long to
+ * find a suitable tuple.
  */
 static bool
 get_actual_variable_endpoint(Relation heapRel,
@@ -7147,22 +7146,30 @@ get_actual_variable_endpoint(Relation heapRel,
 	 *
 	 * Despite all this care, there are situations where we might find many
 	 * non-visible tuples near the end of the index.  We don't want to expend
-	 * a huge amount of time here, so we give up once we've read too many heap
-	 * pages.  When we fail for that reason, the caller will end up using
-	 * whatever extremal value is recorded in pg_statistic.
-	 *
-	 * XXX This can't work with the new table_index_getnext_slot interface,
-	 * which simply won't return a tuple that isn't visible to our snapshot.
-	 * table_index_getnext_slot will need some kind of callback that provides
-	 * a way for the scan to give up when the costs start to get out of hand.
+	 * a huge amount of time here, so we give up after reading a few extremal
+	 * index leaf pages without finding matching items (generally only seen
+	 * when pages have many index tuples with set LP_DEAD bits).  When we give
+	 * up the caller will end up using whatever extremal value is recorded in
+	 * pg_statistic.
 	 */
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/* Set up an index-only scan */
 	index_scan = index_beginscan(heapRel, indexRel, true,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	Assert(index_scan->xs_want_itup);
+
+	/*
+	 * Make our scan read at most 3 index leaf pages before it just gives up.
+	 * This is on the conservative side; giving up after the first leaf page
+	 * would work just as well in most cases.  But it's possible that the
+	 * index's leftmost/rightmost leaf page is one with very few index tuples
+	 * (with or without their LP_DEAD bits set).
+	 */
+	index_scan->xs_read_extremal_only = 3;
+
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-- 
2.53.0



  [application/octet-stream] v13-0008-Add-heapam-index-scan-I-O-prefetching.patch (43.4K, 13-v13-0008-Add-heapam-index-scan-I-O-prefetching.patch)
  download | inline diff:
From d3405036c393821b6648d91eeed8f84e553a4260 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v13 08/19] Add heapam index scan I/O prefetching.

This commit implements I/O prefetching for index scans (and index-only
scans that require heap fetches). This was made possible by the recent
addition of batching interfaces to both the table AM and index AM APIs.

The amgetbatch index AM interface provides batches of matching TIDs
(rather than one tuple at a time), each of which must be taken from
index tuples that appear together on a single index page.  This allows
multiple batches to be held open simultaneously.  Giving the table AM an
explicit understanding of index AM concepts/index page boundaries allows
it to consider all of the relevant costs and benefits.

Prefetching is implemented using a prefetching position under the
control of the table AM and core code.  This is closely related to the
scan position added by commit FIXME, which introduced the amgetbatch
interface.  A read stream callback advances the read stream as needed to
provide sufficiently many heap block numbers to maintain the read
stream's target prefetch distance.

Testing has shown that index prefetching can make index scans much
faster.  Large range scans that return many tuples can be as much as 35x
faster.

An important goal of the amgetbatch design is to enable the table AM's
read stream callback to advance its prefetch position using TIDs that
appear on a leaf page that's ahead of the current scan position's leaf
page.  This is crucial with scans of indexes where each leaf page
happens to have relatively few distinct heap blocks among its matching
TIDs (as well as with scans with leaf pages that have relatively few
total matching items).  Index scans can have as many as 64 open batches,
which testing has shown to be about the maximum number that can ever be
useful.  Batches are maintained in scan order using a simple ring buffer
data structure.

In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
the read stream is temporarily paused.  Prefetching (via the read
stream) is resumed only after the scan position advances beyond its
current open batch and then frees and removes the batch from the scan's
batch ring buffer.  Testing has shown that it isn't very common for
scans to hold open more than about 10 batches to get the desired I/O
prefetch distance.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/include/access/heapam.h                   |  13 +
 src/include/access/relscan.h                  |  35 ++
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 370 +++++++++++++++++-
 src/backend/access/index/indexbatch.c         |  52 ++-
 src/backend/access/nbtree/README              |   2 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  16 +
 doc/src/sgml/indexam.sgml                     | 105 ++++-
 doc/src/sgml/tableam.sgml                     |   8 +
 src/test/regress/expected/sysviews.out        |   3 +-
 13 files changed, 591 insertions(+), 23 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 136019925..14908f41b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -123,6 +123,19 @@ typedef struct IndexFetchHeapData
 	Buffer		xs_vmbuf;		/* visibility map buffer */
 	int			xs_vm_items;	/* # items to resolve visibility info for */
 
+	/* For batch index scans that use read stream for prefetching */
+	ReadStream *xs_read_stream;
+
+	/*
+	 * The read stream is allocated at the beginning of the scan and reset on
+	 * rescan or when the scan direction changes. The scan direction is saved
+	 * each time a new tuple is requested. If the scan direction changes from
+	 * one tuple to the next, the read stream releases all previously pinned
+	 * buffers and resets the prefetch block.
+	 */
+	ScanDirection xs_read_stream_dir;	/* index scan direction */
+	BlockNumber xs_prefetch_block;	/* last block returned to xs_read_stream */
+	bool		xs_paused;		/* paused until next batch is read? */
 	bool		xs_lastinblock; /* last TID on this block in current batch? */
 
 	/* NB: if xs_cbuf or vmbuf are not InvalidBuffer, we hold a pin */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 2db2b678c..14782b599 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -203,6 +203,10 @@ typedef struct IndexScanBatchData
 	 * This allows table AMs to avoid redundant amgetbatch calls with the same
 	 * priorbatch -- the index AM might need to read additional index pages to
 	 * determine there are no more matching items beyond caller's priorbatch.
+	 * In particular, during prefetching the read stream callback discovers
+	 * the end-of-scan via prefetchBatch.  The table AM checks these flags so
+	 * that the scan side doesn't repeat the same amgetbatch call when it
+	 * later reaches that batch as scanBatch.
 	 */
 	bool		knownEndBackward;
 	bool		knownEndForward;
@@ -261,12 +265,21 @@ typedef struct IndexScanBatchData *IndexScanBatch;
  * matches in.  However, table AMs are free to fetch table tuples in whatever
  * order is most convenient/efficient -- provided that such reordering cannot
  * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
  */
 typedef struct BatchRingBuffer
 {
 	/* current positions in batches[] for scan */
 	BatchRingItemPos scanPos;	/* scan's read position */
 	BatchRingItemPos markPos;	/* mark/restore position */
+	BatchRingItemPos prefetchPos;	/* prefetching position */
 
 	IndexScanBatch markBatch;
 
@@ -478,6 +491,28 @@ index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)
 	ringbuf->nextBatch++;
 }
 
+/*
+ * Compare two batch ring positions in the given scan direction.
+ *
+ * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is
+ * ahead of pos2.
+ */
+static inline int
+index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2,
+				   ScanDirection direction)
+{
+	int8		batchdiff = (int8) (pos1->batch - pos2->batch);
+
+	if (batchdiff != 0)
+		return batchdiff;
+
+	/* Same batch, compare items */
+	if (ScanDirectionIsForward(direction))
+		return pos1->item - pos2->item;
+	else
+		return pos2->item - pos1->item;
+}
+
 /*
  * Advance position to its next item in the batch.
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d315..419300a6b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fb25e203a..8d582e3d9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -37,6 +37,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -60,6 +61,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -101,6 +105,17 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 	/* Rescans should avoid an excessive number of VM lookups */
 	hscan->xs_vm_items = 1;
 
+	/* Reset read stream direction unconditionally */
+	hscan->xs_read_stream_dir = NoMovementScanDirection;
+
+	/* Reset read stream itself, and other associated state */
+	if (hscan->xs_read_stream)
+	{
+		hscan->xs_prefetch_block = InvalidBlockNumber;
+		hscan->xs_paused = false;
+		read_stream_reset(hscan->xs_read_stream);
+	}
+
 	/*
 	 * Deliberately avoid dropping any pins now held in xs_cbuf and xs_vmbuf.
 	 * This saves cycles during certain tight nested loop joins, and during
@@ -122,6 +137,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 	if (BufferIsValid(hscan->xs_vmbuf))
 		ReleaseBuffer(hscan->xs_vmbuf);
 
+	if (hscan->xs_read_stream)
+		read_stream_end(hscan->xs_read_stream);
+
 	pfree(hscan);
 }
 
@@ -191,7 +209,14 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		/*
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
+		 */
+		if (hscan->xs_read_stream)
+			hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -276,6 +301,30 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
  * (important for inner index scans of anti-joins and semi-joins), and the
  * need to not hold onto index leaf pages for too long.
  *
+ * Dropping leaf page pins early
+ * -----------------------------
+ *
+ * In no event will the scan be allowed to hold onto more than one batch's
+ * leaf page pin at a time.  The primary reason for this restriction is to
+ * avoid unintended interactions with the read stream, which has its own
+ * strategy for keeping the number of pins held by the backend under control.
+ *
+ * Once we've resolved visibility for all items in a batch, we can safely drop
+ * its leaf page pin.  This is safe with respect to concurrent VACUUM because
+ * index vacuuming will block on acquiring a conflicting cleanup lock on the
+ * batch's index page due to our holding a pin on that same page.  Copying the
+ * relevant visibility map data into our local cache suffices to prevent unsafe
+ * concurrent TID recycling: if any of these TIDs point to dead heap tuples,
+ * VACUUM cannot possibly return from ambulkdelete and mark the pointed-to
+ * heap pages as all-visible.  VACUUM _can_ do so once we release the batch's
+ * pin, but that's okay; we'll be working off of cached visibility info that
+ * indicates that the dead TIDs are NOT all-visible.
+ *
+ * Note: We cannot drop the pin early when the scan uses a non-MVCC snapshot;
+ * we must delay it until all heap fetches for the loaded batch have taken
+ * place.  This is why we don't support prefetching during such scans.  See
+ * doc/src/sgml/indexam.sgml.
+ *
  * Note on Memory Ordering Effects
  * -------------------------------
  *
@@ -492,11 +541,13 @@ static pg_attribute_hot IndexScanBatch
 heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
 					 IndexScanBatch priorBatch, BatchRingItemPos *pos)
 {
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 	IndexScanBatch batch = NULL;
 	BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf;
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
+	Assert(direction == hscan->xs_read_stream_dir);
 
 	if (!priorBatch)
 	{
@@ -565,9 +616,39 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
 		 */
 		if (unlikely(scan->xs_read_extremal_only) && priorBatch)
 		{
+			Assert(!hscan->xs_read_stream);
 			Assert(scan->xs_want_itup);
 			return NULL;
 		}
+
+		/*
+		 * Delay initializing stream until reading from scan's second batch.
+		 * This heuristic avoids wasting cycles on starting a read stream for
+		 * very selective index scans.  We can likely improve upon this, but
+		 * it works well enough for now.
+		 *
+		 * Also avoid prefetching during scans where we're unable to drop each
+		 * batch's buffer pin right away (non-MVCC snapshot scans).  We are
+		 * not prepared to sensibly limit the total number of buffer pins held
+		 * (read stream handles all pin resource management for us, and knows
+		 * nothing about pins held on index pages/within batches).
+		 *
+		 * Also delay creating a read stream during index-only scans that
+		 * haven't done any heap fetches yet.  We don't want to waste any
+		 * cycles on allocating a read stream until we have a demonstrated
+		 * need for perform heap fetches.
+		 */
+		if (!hscan->xs_read_stream && priorBatch && scan->MVCCScan &&
+			hscan->xs_blk != InvalidBlockNumber &&	/* for index-only scans */
+			enable_indexscan_prefetch)
+		{
+			Assert(!batchringbuf->prefetchPos.valid);
+
+			hscan->xs_read_stream =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   heapam_getnext_stream, scan, 0);
+		}
 	}
 	else
 	{
@@ -592,6 +673,32 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
 	return batch;
 }
 
+/*
+ * Handle a change in index scan direction (at the tuple granularity).
+ *
+ * Resets the read stream, since we can't rely on scanPos continuing to agree
+ * with the blocks that read stream already consumed using prefetchPos.
+ *
+ * Note: iff the scan _continues_ in this new direction, and actually steps
+ * off scanBatch to an earlier index page, heapam_batch_getnext will deal with
+ * it.  But that might never happen; the scan might yet change direction again
+ * (or just end before returning more items).
+ */
+static pg_noinline void
+heapam_dirchange_readstream_reset(IndexFetchHeapData *hscan,
+								  ScanDirection direction,
+								  BatchRingBuffer *batchringbuf)
+{
+	/* Reset read stream state */
+	batchringbuf->prefetchPos.valid = false;
+	hscan->xs_paused = false;
+	hscan->xs_read_stream_dir = direction;
+
+	/* Reset read stream itself */
+	if (hscan->xs_read_stream)
+		read_stream_reset(hscan->xs_read_stream);
+}
+
 /* ----------------
  *		heapam_batch_getnext_tid - get next TID from batch ring buffer
  *
@@ -610,6 +717,12 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
 	Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch);
 	Assert(scanPos->valid || index_scan_batch_count(scan) == 0);
 
+	/* Handle resetting the read stream when scan direction changes */
+	if (hscan->xs_read_stream_dir == NoMovementScanDirection)
+		hscan->xs_read_stream_dir = direction;	/* first call */
+	else if (unlikely(hscan->xs_read_stream_dir != direction))
+		heapam_dirchange_readstream_reset(hscan, direction, batchringbuf);
+
 	/*
 	 * Check if there's an existing loaded scanBatch for us to return the next
 	 * matching item's TID/index tuple from
@@ -618,7 +731,7 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
 	{
 		/*
 		 * scanPos is valid, so scanBatch must already be loaded in batch ring
-		 * buffer.  We rely on that here.
+		 * buffer.  We rely on that here (can't do this with prefetchBatch).
 		 */
 		Assert(batchringbuf->headBatch == scanPos->batch);
 
@@ -663,21 +776,274 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
 	{
 		IndexScanBatch headBatch = index_scan_batch(scan,
 													batchringbuf->headBatch);
+		BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
 
 		/* free obsolescent head batch (unless it is scan's markBatch) */
 		tableam_util_free_batch(scan, headBatch);
 
+		/*
+		 * If we're about to release the batch that prefetchPos currently
+		 * points to, just invalidate prefetchPos.  We'll reinitialize it
+		 * using scanPos if and when heapam_getnext_stream is next called. (We
+		 * must avoid confusing a prefetchPos->batch that's actually before
+		 * headBatch with one that's after nextBatch due to uint8 overflow;
+		 * simplest way is to invalidate prefetchPos like this.)
+		 */
+		if (prefetchPos->valid &&
+			prefetchPos->batch == batchringbuf->headBatch)
+			prefetchPos->valid = false;
+
 		/* Remove the batch from the ring buffer */
 		batchringbuf->headBatch++;
+
+		if (hscan->xs_paused)
+		{
+			/*
+			 * The scan's read stream was paused by heapam_getnext_stream due
+			 * to exhausting all available free batch slots.  We just freed up
+			 * one such slot now, though.  Resume the read stream to re-enable
+			 * prefetching.
+			 */
+			Assert(!index_scan_batch_full(scan));
+			read_stream_resume(hscan->xs_read_stream);
+			hscan->xs_paused = false;
+		}
 	}
 
 	/* In practice scanBatch will always be the ring buffer's headBatch */
 	Assert(batchringbuf->headBatch == scanPos->batch);
+	Assert(!hscan->xs_paused);
 
 	return heapam_batch_return_tid(scan, hscan, direction, scanBatch, scanPos,
 								   all_visible);
 }
 
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * The initial batch is always loaded by heapam_batch_getnext_tid.  We don't
+ * get called until the first read_stream_next_buffer() call, when a heap
+ * block is requested from the scan's stream for the first time.
+ *
+ * The position of the read_stream is stored in prefetchPos.  It is typical for
+ * prefetchPos to consistently stay ahead of the scanPos position that's used to
+ * track the next TID to be returned to the scan by heapam_batch_getnext_tid
+ * after the first time we get called.  However, that isn't a precondition.
+ * There is a strict postcondition, though: when we return we'll always leave
+ * scanPos <= prefetchPos (except in cases where we return InvalidBlockNumber).
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
+	ScanDirection xs_read_stream_dir = hscan->xs_read_stream_dir;
+	IndexScanBatch prefetchBatch;
+	bool		fromScanPos = false;
+
+	/*
+	 * scanPos must always be valid when prefetching takes place.  There has
+	 * to be at least one batch, loaded as our scanBatch.  The scan direction
+	 * must be established, too.
+	 */
+	Assert(index_scan_batch_count(scan) > 0);
+	Assert(scan->MVCCScan);
+	Assert(scanPos->valid);
+	Assert(!hscan->xs_paused);
+	Assert(xs_read_stream_dir != NoMovementScanDirection);
+
+	/*
+	 * prefetchPos might not yet be valid.  It might have also fallen behind
+	 * scanPos.  Deal with both.
+	 *
+	 * If prefetchPos has not been initialized yet, that typically indicates
+	 * that this is the first call here for the entire scan.  We initialize
+	 * prefetchPos using the current scanPos, since the current scanBatch
+	 * item's TID should have its block number returned by the read stream
+	 * first.  It's likely that prefetchPos will get ahead of scanPos before
+	 * long, but that hasn't happened yet.
+	 *
+	 * It's also possible for prefetchPos to "fall behind" scanPos, at least
+	 * in a trivial sense: if many adjacent items are returned that contain
+	 * TIDs that point to the same heap block, scanPos can actually overtake
+	 * prefetchPos (prefetchPos can't advance until the scan actually calls
+	 * read_stream_next_buffer).  Reinitializing from scanPos is enough to
+	 * ensure that prefetchPos still fetches the next heap block that scanPos
+	 * will require (prefetchPos can never fall behind "by more than one group
+	 * of items that all point to the same heap block", so this is safe).
+	 *
+	 * Note: when heapam_batch_getnext_tid frees a batch that prefetchPos
+	 * points to, it'll invalidate prefetchPos for us.  This removes any
+	 * danger of prefetchPos.batch falling so far behind scanPos.batch that it
+	 * wraps around (and appears to be ahead of scanPos instead of behind it).
+	 */
+	if (!prefetchPos->valid ||
+		index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 0)
+	{
+		hscan->xs_prefetch_block = InvalidBlockNumber;
+		*prefetchPos = *scanPos;
+		fromScanPos = true;
+
+		/*
+		 * We must avoid holding on to any batch's buffer pin for more than an
+		 * instant, to avoid undesirable interactions with the scan's read
+		 * stream.  batchImmediateRelease scans always get this behavior
+		 * automatically.  Other types of scans (these are all index-only
+		 * scans in practice) are made to drop their buffer pin eagerly
+		 * through a policy of always eagerly setting all the batch item's
+		 * visibility info in one go.
+		 */
+		if (scan->xs_want_itup)
+		{
+			HeapBatchData *hbatch;
+
+			/* Make heapam_batch_resolve_visibility release resources eagerly */
+			hscan->xs_vm_items = scan->maxitemsbatch;
+
+			/* Make sure that this new prefetchBatch has no resources held */
+			prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+			hbatch = heap_batch_data(prefetchBatch, scan);
+
+			/* Set visibility info not set through scanBatch */
+			heapam_batch_resolve_visibility(scan, xs_read_stream_dir,
+											prefetchBatch, hbatch,
+											prefetchPos);
+		}
+		else
+			Assert(scan->batchImmediateRelease);
+	}
+
+	prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+	for (;;)
+	{
+		BatchMatchingItem *item;
+		BlockNumber prefetch_block;
+
+		/*
+		 * We never call amgetbatch without immediately releasing the batch's
+		 * index AM resources (which requires special care during index-only
+		 * scans).  The read stream is sensitive to buffer shortages, so we
+		 * defensively avoid anything that visibly affects the per-backend
+		 * buffer limit.
+		 */
+
+		if (fromScanPos)
+		{
+			/*
+			 * Don't increment item when prefetchPos was just initialized
+			 * using scanPos.  We'll return the scanPos item's heap block
+			 * directly on the first call here.  In other words, we'll return
+			 * the heap block for the TID passed to heapam_index_fetch_tuple
+			 * at the point where it called read_stream_next_buffer for the
+			 * first time during the scan.
+			 */
+			fromScanPos = false;
+		}
+		else if (!index_scan_pos_advance(xs_read_stream_dir,
+										 prefetchBatch, prefetchPos))
+		{
+			/*
+			 * Ran out of items from prefetchBatch.  Try to advance to the
+			 * scan's next batch.
+			 */
+			if (unlikely(index_scan_batch_full(scan)))
+			{
+				/*
+				 * Can't advance prefetchBatch because all available
+				 * batchringbuf batch slots are currently in use.
+				 *
+				 * Deal with this by momentarily pausing the read stream.
+				 * heapam_batch_getnext_tid will resume the read stream later,
+				 * though only after scanPos has consumed all remaining items
+				 * from scanBatch (at which point scanBatch will be freed,
+				 * making its slot available for reuse by a later batch).
+				 *
+				 * In practice we hardly ever need to do this.  It would be
+				 * possible to avoid the need to pause the read stream by
+				 * dynamically allocating slots, but that would add complexity
+				 * for no real benefit.
+				 */
+				hscan->xs_paused = true;
+				return read_stream_pause(stream);
+			}
+
+			prefetchBatch = heapam_batch_getnext(scan, xs_read_stream_dir,
+												 prefetchBatch, prefetchPos);
+			if (!prefetchBatch)
+			{
+				/*
+				 * Failed to load next batch, so all the batches that the scan
+				 * will ever require (barring a change in scan direction) are
+				 * now loaded
+				 */
+				return InvalidBlockNumber;
+			}
+
+			/* Position prefetchPos to the start of new prefetchBatch */
+			index_scan_pos_nextbatch(xs_read_stream_dir,
+									 prefetchBatch, prefetchPos);
+
+			if (scan->xs_want_itup)
+			{
+				HeapBatchData *hbatch = heap_batch_data(prefetchBatch, scan);
+
+				/* make sure we have visibility info for the entire batch */
+				heapam_batch_resolve_visibility(scan, xs_read_stream_dir,
+												prefetchBatch, hbatch,
+												prefetchPos);
+			}
+			else
+				Assert(scan->batchImmediateRelease);
+		}
+
+		/*
+		 * prefetchPos now points to the next item whose TID's heap block
+		 * number might need to be prefetched
+		 */
+		Assert(index_scan_batch(scan, prefetchPos->batch) == prefetchBatch);
+		Assert(prefetchPos->item >= prefetchBatch->firstItem &&
+			   prefetchPos->item <= prefetchBatch->lastItem);
+		/* scanPos is always <= prefetchPos when we return */
+		Assert(index_scan_pos_cmp(scanPos, prefetchPos, xs_read_stream_dir) <= 0);
+
+		if (scan->xs_want_itup)
+		{
+			HeapBatchData *hbatch = heap_batch_data(prefetchBatch, scan);
+
+			Assert(hbatch->visInfo[prefetchPos->item] & BATCH_VIS_CHECKED);
+			if (hbatch->visInfo[prefetchPos->item] & BATCH_VIS_ALL_VISIBLE)
+			{
+				/* item is known to be all-visible -- don't prefetch */
+				continue;
+			}
+		}
+
+		item = &prefetchBatch->items[prefetchPos->item];
+		prefetch_block = ItemPointerGetBlockNumber(&item->tableTid);
+
+		if (prefetch_block == hscan->xs_prefetch_block)
+		{
+			/*
+			 * prefetch_block matches the last prefetchPos item's TID's heap
+			 * block number; we must not return the same prefetch_block twice
+			 * (twice in succession)
+			 */
+			continue;
+		}
+
+		/* We have a new heap block number to return to read stream */
+		hscan->xs_prefetch_block = prefetch_block;
+		return prefetch_block;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index a9a72810f..e272dab9e 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -10,7 +10,10 @@
  * approach enables efficient prefetching of table AM blocks during ordered
  * index scans.
  *
- * The ring buffer loads batches in index key space order.
+ * The ring buffer loads batches in index key space order.  This allows the
+ * table AM to maintain an adequate prefetch distance: its read stream
+ * callback is thereby able to request table blocks referenced by index pages
+ * that are well ahead of the current scan position's index page.
  *
  * There's three types of functions in this module:
  *
@@ -28,6 +31,28 @@
  *    AMs that implement the amgetbatch interface.  These manage batch
  *    allocation, index page buffer lock release, and batch memory recycling.
  *
+ * These three layers coordinate without explicit coupling: the core lifecycle
+ * functions assume that table AMs use scanPos/scanBatch and prefetchPos/
+ * prefetchBatch in a standardized way (see heapam_handler.c for the reference
+ * implementation), while table AMs assume that index AMs free and unlock
+ * batches according to the conventions established here.  See indexam.sgml
+ * for the full specification of the amgetbatch/amkillitemsbatch contract.
+ *
+ * The table AM fully controls the read stream as its own private state.
+ * When the scan direction changes, the table AM must immediately reset its
+ * read stream and invalidate prefetchPos -- blocks already requested via
+ * prefetchPos will no longer match what scanPos needs to return.
+ *
+ * Crossing a batch boundary in a new scan direction is a separate process,
+ * handled here: table AMs are required to call tableam_util_batch_dirchange
+ * to leave the scan's batch ring buffer in a consistent state.  The current
+ * implementation handles this by simply discarding most batches.  The key
+ * invariant is that all loaded batches must be in a consistent scan direction
+ * order.  (During cross-batch direction changes, the current scanBatch will
+ * have its IndexScanBatchData.dir flipped, but we have no provision for
+ * keeping all other loaded batches.  It's not clear that it'd be useful to
+ * hold onto them; the scan direction is unlikely to change back.)
+ *
  * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -60,6 +85,7 @@ index_batchscan_init(IndexScanDesc scan)
 
 	scan->batchringbuf.scanPos.valid = false;
 	scan->batchringbuf.markPos.valid = false;
+	scan->batchringbuf.prefetchPos.valid = false;
 
 	scan->batchringbuf.markBatch = NULL;
 	scan->batchringbuf.headBatch = 0;	/* initial head batch */
@@ -84,6 +110,7 @@ index_batchscan_reset(IndexScanDesc scan)
 
 	batchringbuf->scanPos.valid = false;
 	batchringbuf->markPos.valid = false;
+	batchringbuf->prefetchPos.valid = false;
 
 	/*
 	 * Ensure tableam_util_free_batch won't skip the old markBatch in the loop
@@ -220,7 +247,13 @@ index_batchscan_mark_pos(IndexScanDesc scan)
  * the current scanBatch when needed.
  *
  * We just discard all batches (other than markBatch/restored scanBatch),
- * except when markBatch is already the scan's current scanBatch.
+ * except when markBatch is already the scan's current scanBatch.  We always
+ * invalidate prefetchPos.  The read stream and related prefetching state are
+ * reset by table_index_fetch_reset(), called before this function.  This
+ * approach keeps things simple for table AMs: most code that deals with
+ * batches is thereby able to assume that the common case where scan direction
+ * never changes is the only case (tableam_util_batch_dirchange takes a
+ * similar approach to handling a cross-batch change in scan direction).
  */
 void
 index_batchscan_restore_pos(IndexScanDesc scan)
@@ -235,6 +268,14 @@ index_batchscan_restore_pos(IndexScanDesc scan)
 	Assert(!batchringbuf->done);
 	Assert(markPos->valid);
 
+	/*
+	 * Restoring a mark always requires stopping prefetching.  This is similar
+	 * to the handling table AMs implement to deal with a tuple-level change
+	 * in the scan's direction.  The read stream must have already been reset
+	 * by the caller (via table_index_fetch_reset).
+	 */
+	batchringbuf->prefetchPos.valid = false;
+
 	if (scanBatch == markBatch)
 	{
 		/* markBatch is already scanBatch; needn't change batchringbuf */
@@ -304,6 +345,13 @@ index_batchscan_restore_pos(IndexScanDesc scan)
  * point on batchringbuf will look as if our new scan direction had been used
  * from the start.  This approach isn't particularly efficient, but it works
  * well enough for what ought to be a relatively rare occurrence.
+ *
+ * Caller must have reset the scan's read stream before calling here.  That
+ * needs to happen as soon as the scan requests a tuple in whatever scan
+ * direction is opposite-to-current.  We only deal with the case where the
+ * scan backs up by enough items to cross a batch boundary (when the scan
+ * resumes scanning in its original direction/ends before crossing a boundary,
+ * there isn't any need to call here).
  */
 void
 tableam_util_batch_dirchange(IndexScanDesc scan)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e75577a7e..3939391ae 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -186,7 +186,7 @@ interface.  (See also, doc/src/sgml/indexam.sgml).
 Blocking VACUUM like this can be disruptive, so table AMs avoid it whenever
 possible.  The heap table AM usually drops leaf page pins right away, though
 not during scans that use a non-MVCC snapshot.  Index-only scans may also
-retain pins in some cases.
+retain pins in some cases, though prefetching requires dropping them.
 
 Opportunistic index tuple deletion performs the same page-level
 modifications as VACUUM, while only holding an exclusive lock.  This is
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 89ca4e08b..78d87cd8b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -145,6 +145,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a5a0edf25..78c3c647f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -891,6 +891,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e686d88af..aad256ea8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -417,6 +417,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fb..4a4a09ad7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5712,6 +5712,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexscan-prefetch" xreflabel="enable_indexscan_prefetch">
+      <term><varname>enable_indexscan_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexscan_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables prefetching for index-scan and index-only-scan
+        plan types.  Prefetching can improve performance by reading table AM
+        pages ahead of when they are needed during index scans.  The default
+        is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 219fb73e6..146f27778 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -809,9 +809,12 @@ amgetbatch (IndexScanDesc scan,
   <para>
    The <function>amgetbatch</function> interface is an alternative to
    <function>amgettuple</function> that returns matching index entries in batches
-   rather than one at a time.  By returning all matching index entries from a
-   single index page together, the table AM gains visibility into which table
-   blocks will be needed in the near future.
+   rather than one at a time. This enables the table access method to
+   optimize table block access patterns and perform I/O prefetching.
+   By returning all matching index entries from a single index page together,
+   the table AM can readahead through the index and identify which table
+   blocks will be needed, allowing prefetching of table AM pages during
+   ordered index scans.
   </para>
 
   <para>
@@ -1362,6 +1365,63 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
+  <sect2 id="index-scanning-batches">
+   <title>Table AM Considerations for Batch Scanning</title>
+
+   <para>
+    This section is primarily relevant to
+    <link linkend="tableam">table access method</link> authors.
+    When an index scan uses the <function>amgetbatch</function> interface,
+    the table AM is responsible for managing position state within the
+    <structname>IndexScanDesc</structname>'s
+    <structfield>batchringbuf</structfield> and for controlling when
+    buffer pins on index pages are released.
+   </para>
+
+   <para>
+    The <structfield>scanPos</structfield> field within
+    <structfield>batchringbuf</structfield> tracks which batch and item within
+    that batch will be returned next to the executor.  The table AM must advance
+    <structfield>scanPos</structfield> as tuples are returned by
+    <function>table_index_getnext_slot</function>.  The core code may also
+    modify this field during operations such as mark/restore.
+   </para>
+
+   <para>
+    The <structfield>prefetchPos</structfield> field tracks the position used
+    for I/O prefetching.  It is generally advanced by initializing it from
+    <structfield>scanPos</structfield> within a read stream callback, allowing
+    the table AM to prefetch table blocks pointed to by items that are well
+    ahead of the current scan position.  Initially
+    <structfield>prefetchPos</structfield> starts at
+    <structfield>scanPos</structfield>, but as the read stream ramps up it can
+    get far ahead &mdash; spanning multiple index pages if necessary to
+    maintain an optimal I/O prefetch distance for table block reads.  A major
+    goal of the <function>amgetbatch</function> interface is to allow the
+    table AM to prefetch without being limited to items from the current
+    <structfield>scanPos</structfield> batch's index leaf page.
+   </para>
+
+   <para>
+    Both <structfield>scanPos</structfield> and
+    <structfield>prefetchPos</structfield> are controlled by the table AM and
+    core code; index access methods should not access or manipulate these
+    fields.  See the <filename>src/backend/access/heap/</filename>
+    implementation for a reference example.
+   </para>
+
+   <para>
+    Index page resources held by <function>amgetbatch</function> batches
+    (typically buffer pins, stored in the index AM's per-batch opaque area)
+    are owned by the index AM but released under the table AM's control via
+    the <function>amreleasebatch</function> callback.  See the
+    <function>amgetbatch</function>, <function>amreleasebatch</function>, and
+    <function>amkillitemsbatch</function> descriptions in
+    <xref linkend="index-functions"/> for details.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-locking">
@@ -1457,29 +1517,40 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    This solution requires that <function>amgettuple</function> index scans be
    <quote>synchronous</quote>: the table AM must fetch each heap tuple
    immediately after scanning the corresponding index entry.  This is
-   expensive for a number of reasons.  An
-   <quote>asynchronous</quote> scan in which we collect many TIDs from the
-   index, and only visit the heap tuples sometime later, requires much less
-   index locking overhead and can allow a more efficient heap access pattern.
+   expensive for a number of reasons.  The
+   <function>amgetbatch</function> interface, by contrast, was designed to
+   allow scans to be <quote>asynchronous</quote>: by collecting batches of
+   TIDs from multiple index pages, the table AM can prefetch the corresponding
+   table blocks well ahead of the current scan position (using asynchronous
+   I/O when available), requiring much less index locking overhead and allowing
+   a more efficient heap access pattern.  Not all scans end up being
+   asynchronous in practice, but the interface is designed to allow it.
    Per the above analysis, we must use the synchronous approach for
    non-MVCC-compliant snapshots, but an asynchronous scan is workable
    for a query using an MVCC snapshot.
   </para>
 
   <para>
-   Index page resources held by <function>amgetbatch</function> batches
-   (typically buffer pins, stored in the index AM's per-batch opaque area)
-   are owned by the index AM but released under the table AM's control via
-   the <function>amreleasebatch</function> callback.  See the
-   <function>amgetbatch</function>, <function>amreleasebatch</function>, and
-   <function>amkillitemsbatch</function> descriptions in
-   <xref linkend="index-functions"/> for details.
+   Because the table AM reads multiple index leaf pages ahead via
+   <function>amgetbatch</function> to facilitate this prefetching, it cannot
+   practically hold pins on all those pages simultaneously.  Therefore,
+   I/O prefetching with
+   <function>amgetbatch</function> is only possible when an MVCC-compliant
+   snapshot is in use.  In practice, the heap table AM (and any table AM
+   with similar concurrency rules) usually releases resources eagerly for
+   plain MVCC index scans, but retains them for non-MVCC snapshot scans.
+   Index-only scans may retain resources in some cases, while plain index
+   scans that use an MVCC snapshot always release their resources eagerly.
+   The table AM decides <emphasis>when</emphasis> to call
+   <function>amreleasebatch</function>; the index AM decides
+   <emphasis>what</emphasis> to release.
   </para>
 
   <para>
-   In an <function>amgetbitmap</function> index scan, the access method does
-   not keep an index pin on any of the returned tuples.  Therefore
-   it is only safe to use such scans with MVCC-compliant snapshots.
+   Similarly, an <function>amgetbitmap</function> index scan is inherently
+   asynchronous: all matching TIDs are collected into a bitmap before any heap
+   access begins.  Such scans therefore require an MVCC-compliant snapshot,
+   and there is no need for the access method to hold index page pins.
   </para>
 
   <para>
diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
index 9ccf5b739..8e70a6196 100644
--- a/doc/src/sgml/tableam.sgml
+++ b/doc/src/sgml/tableam.sgml
@@ -129,6 +129,14 @@ my_tableam_handler(PG_FUNCTION_ARGS)
   optional), the block number needs to provide locality.
  </para>
 
+ <para>
+  Table access methods can support ordered index scans using the
+  <function>amgetbatch</function> interface. See also
+  <xref linkend="index-scanning-batches"/> for details on interfacing with
+  <function>amgetbatch</function> index access methods, and managing the
+  scan's position.
+ </para>
+
  <para>
   For crash safety, an AM can use postgres' <link
   linkend="wal"><acronym>WAL</acronym></link>, or a custom implementation.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 132b56a58..32bc3dd3e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.53.0



  [application/octet-stream] v13-0010-Make-buffer-hit-helper.patch (6.0K, 14-v13-0010-Make-buffer-hit-helper.patch)
  download | inline diff:
From fd71a733176b00ea014907b4748904cf2207a580 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 23 Jan 2026 13:54:02 -0500
Subject: [PATCH v13 10/19] Make buffer hit helper

Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Note: I (pgeoghean) have changed this from Melanie's original by
inlining the helper function:

https://postgr.es/m/CAH2-Wzk02vPjsJPx7EGNmgvsKKgyHn=XtGjJcPE+eQTP3xQt7w@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
 1 file changed, 56 insertions(+), 55 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc60952..70a5dba73 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+static inline void ProcessBufferHit(BufferAccessStrategy strategy,
+									Relation rel, char persistence,
+									SMgrRelation smgr, ForkNumber forknum,
+									BlockNumber blocknum);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
@@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
 
@@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
 			persistence == RELPERSISTENCE_PERMANENT ||
 			persistence == RELPERSISTENCE_UNLOGGED));
 
-	if (persistence == RELPERSISTENCE_TEMP)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.backend);
 
 	if (persistence == RELPERSISTENCE_TEMP)
-	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
-		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
-	}
 	else
-	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
-		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
-	}
+							 strategy, foundPtr, IOContextForStrategy(strategy));
+
 	if (rel)
 	{
 		/*
@@ -1274,22 +1258,10 @@ PinBufferForBlock(Relation rel,
 		 * zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
-		if (*foundPtr)
-			pgstat_count_buffer_hit(rel);
 	}
-	if (*foundPtr)
-	{
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  true);
-	}
+	if (*foundPtr)
+		ProcessBufferHit(strategy, rel, persistence, smgr, forkNum, blockNum);
 
 	return BufferDescriptorGetBuffer(bufHdr);
 }
@@ -1695,6 +1667,51 @@ ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 	return ReadBuffersCanStartIOOnce(buffer, nowait);
 }
 
+/*
+ * We track various stats related to buffer hits. Because this is done in a
+ * few separate places, this helper exists for convenience.
+ */
+static inline void
+ProcessBufferHit(BufferAccessStrategy strategy,
+				 Relation rel, char persistence, SMgrRelation smgr,
+				 ForkNumber forknum, BlockNumber blocknum)
+{
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
+									  blocknum,
+									  smgr->smgr_rlocator.locator.spcOid,
+									  smgr->smgr_rlocator.locator.dbOid,
+									  smgr->smgr_rlocator.locator.relNumber,
+									  smgr->smgr_rlocator.backend,
+									  true);
+
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_hit += 1;
+	else
+		pgBufferUsage.shared_blks_hit += 1;
+
+	if (rel)
+		pgstat_count_buffer_hit(rel);
+
+	pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageHit;
+}
+
 /*
  * Helper for WaitReadBuffers() that processes the results of a readv
  * operation, raising an error if necessary.
@@ -1990,25 +2007,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 * must have started out as a miss in PinBufferForBlock(). The other
 		 * backend will track this as a 'read'.
 		 */
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
-										  operation->smgr->smgr_rlocator.locator.spcOid,
-										  operation->smgr->smgr_rlocator.locator.dbOid,
-										  operation->smgr->smgr_rlocator.locator.relNumber,
-										  operation->smgr->smgr_rlocator.backend,
-										  true);
-
-		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
-		else
-			pgBufferUsage.shared_blks_hit += 1;
-
-		if (operation->rel)
-			pgstat_count_buffer_hit(operation->rel);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		ProcessBufferHit(operation->strategy, operation->rel, persistence,
+						 operation->smgr, forknum,
+						 blocknum + operation->nblocks_done);
 	}
 	else
 	{
-- 
2.53.0



  [application/octet-stream] v13-0009-Use-ExecSetTupleBound-hint-during-index-scans.patch (10.9K, 15-v13-0009-Use-ExecSetTupleBound-hint-during-index-scans.patch)
  download | inline diff:
From 29b06356ecf9e4ee77bc1267071c43aec9274404 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Thu, 22 Jan 2026 13:07:13 -0500
Subject: [PATCH v13 09/19] Use ExecSetTupleBound hint during index scans.

This gives index scans a way to avoid using a read stream during certain
kinds of queries that are very unlikely to benefit from prefetching:
queries whose plan involves a LIMIT node that is consumes tuples from an
index scan (or index-only scan) node.

Testing has shown this to be particularly important with nested loop
joins with a LIMIT on an inner index scan.  This is typical of nested
loop anti-joins, and nested loop semi-joins.

XXX This is still very much a WIP.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
---
 src/include/access/relscan.h             |  2 +
 src/include/nodes/execnodes.h            |  4 ++
 src/backend/access/heap/heapam_handler.c |  5 +++
 src/backend/access/index/genam.c         |  1 +
 src/backend/executor/execProcnode.c      | 50 ++++++++++++++++++++++++
 src/backend/executor/nodeIndexonlyscan.c | 10 +++++
 src/backend/executor/nodeIndexscan.c     | 13 ++++++
 7 files changed, 85 insertions(+)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 14782b599..72d517487 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -396,6 +396,8 @@ typedef struct IndexScanDescData
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 
+	int64		tuples_needed;
+
 	/*
 	 * Counter to request early abort during get_actual_variable_range scans.
 	 * When nonzero, the scan will read at most this many leaf pages before
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7c1b427fb..12dd533c3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1697,6 +1697,7 @@ typedef struct
  *		ScanDesc		   index scan descriptor
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
+ *		TuplesNeeded	   tuple bound, see ExecSetTupleBound
  *
  *		ReorderQueue	   tuples that need reordering due to re-check
  *		ReachedEnd		   have we fetched all tuples from index already?
@@ -1725,6 +1726,7 @@ typedef struct IndexScanState
 	struct IndexScanDescData *iss_ScanDesc;
 	IndexScanInstrumentation *iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
+	int64		iss_TuplesNeeded;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1753,6 +1755,7 @@ typedef struct IndexScanState
  *		ScanDesc		   index scan descriptor
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
+ *		TuplesNeeded	   tuple bound, see ExecSetTupleBound
  *		TableSlot		   slot for holding tuples fetched from the table
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
@@ -1775,6 +1778,7 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	IndexScanInstrumentation *ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
+	int64		ioss_TuplesNeeded;
 	TupleTableSlot *ioss_TableSlot;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8d582e3d9..d2e84d51b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -637,9 +637,14 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
 		 * haven't done any heap fetches yet.  We don't want to waste any
 		 * cycles on allocating a read stream until we have a demonstrated
 		 * need for perform heap fetches.
+		 *
+		 * Also avoiding prefetching when the core executor passes the scan a
+		 * tuples_needed hint that indicates that the scan is likely to end
+		 * before long.
 		 */
 		if (!hscan->xs_read_stream && priorBatch && scan->MVCCScan &&
 			hscan->xs_blk != InvalidBlockNumber &&	/* for index-only scans */
+			(scan->tuples_needed == -1 || scan->tuples_needed > 20) &&
 			enable_indexscan_prefetch)
 		{
 			Assert(!batchringbuf->prefetchPos.valid);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index d50e3fa71..aa44a0ec2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_itupdesc = NULL;
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
+	scan->tuples_needed = -1;	/* no limit */
 	scan->xs_read_extremal_only = 0;
 
 	scan->batch_index_opaque_size = 0;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7e40b8525..aaed3506a 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -72,6 +72,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "executor/nodeAppend.h"
@@ -840,6 +841,12 @@ ExecShutdownNode_walker(PlanState *node, void *context)
  * Any negative tuples_needed value means "no limit", which should be the
  * default assumption when this is not called at all for a particular node.
  *
+ * Note: for nodes like Sort, tuples_needed is a hard limit -- the node can
+ * stop after producing exactly that many tuples.  For index scans, however,
+ * tuples_needed is only an approximation, because non-index quals may filter
+ * out some tuples.  The actual number of tuples fetched from the index may
+ * need to exceed tuples_needed to satisfy the caller's requirements.
+ *
  * Note: if this is called repeatedly on a plan tree, the exact same set
  * of nodes must be updated with the new limit each time; be careful that
  * only unchanging conditions are tested here.
@@ -977,6 +984,49 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 
 		ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
 	}
+	else if (IsA(child_node, IndexScanState))
+	{
+		/*
+		 * If it is an IndexScan, save the tuples_needed in the state so it
+		 * can be propagated to the IndexScanDesc when the scan is started.
+		 *
+		 * Note: As with Sort, the index scan node is responsible for reacting
+		 * properly to changes to this parameter.
+		 */
+		IndexScanState *isstate = (IndexScanState *) child_node;
+
+		isstate->iss_TuplesNeeded = tuples_needed;
+
+		/* If scan already started, update the IndexScanDesc too */
+		if (isstate->iss_ScanDesc)
+			isstate->iss_ScanDesc->tuples_needed = tuples_needed;
+	}
+	else if (IsA(child_node, IndexOnlyScanState))
+	{
+		/* Same comments as for IndexScan */
+		IndexOnlyScanState *iosstate = (IndexOnlyScanState *) child_node;
+
+		iosstate->ioss_TuplesNeeded = tuples_needed;
+
+		/* If scan already started, update the IndexScanDesc too */
+		if (iosstate->ioss_ScanDesc)
+			iosstate->ioss_ScanDesc->tuples_needed = tuples_needed;
+	}
+	else if (IsA(child_node, NestLoopState))
+	{
+		/*
+		 * For NestLoop joins where each outer tuple produces at most one
+		 * output tuple, we can propagate the bound to the outer child
+		 */
+		NestLoopState *nlstate = (NestLoopState *) child_node;
+		JoinType	jointype = nlstate->js.jointype;
+
+		if (jointype == JOIN_SEMI || jointype == JOIN_ANTI ||
+			nlstate->js.single_match)
+		{
+			ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
+		}
+	}
 
 	/*
 	 * In principle we could descend through any plan node type that is
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 84bff60ce..f3fc63abe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -96,6 +96,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc = scandesc;
 		Assert(node->ioss_ScanDesc->xs_want_itup);
 
+		/* Pass down any tuple bound */
+		scandesc->tuples_needed = node->ioss_TuplesNeeded;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -528,6 +531,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ioss_RuntimeKeysReady = false;
 	indexstate->ioss_RuntimeKeys = NULL;
 	indexstate->ioss_NumRuntimeKeys = 0;
+	indexstate->ioss_TuplesNeeded = -1;
 
 	/*
 	 * build the index scan keys from the index qualification
@@ -704,6 +708,9 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 piscan);
 	Assert(node->ioss_ScanDesc->xs_want_itup);
 
+	/* Pass down any tuple bound */
+	node->ioss_ScanDesc->tuples_needed = node->ioss_TuplesNeeded;
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -769,6 +776,9 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 piscan);
 	Assert(node->ioss_ScanDesc->xs_want_itup);
 
+	/* Pass down any tuple bound */
+	node->ioss_ScanDesc->tuples_needed = node->ioss_TuplesNeeded;
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 67822947a..3e174cb65 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,9 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Pass down any tuple bound */
+		scandesc->tuples_needed = node->iss_TuplesNeeded;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -211,6 +214,9 @@ IndexNextWithReorder(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Pass down any tuple bound */
+		scandesc->tuples_needed = node->iss_TuplesNeeded;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -986,6 +992,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->iss_RuntimeKeysReady = false;
 	indexstate->iss_RuntimeKeys = NULL;
 	indexstate->iss_NumRuntimeKeys = 0;
+	indexstate->iss_TuplesNeeded = -1;
 
 	/*
 	 * build the index scan keys from the index qualification
@@ -1730,6 +1737,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	/* Pass down any tuple bound */
+	node->iss_ScanDesc->tuples_needed = node->iss_TuplesNeeded;
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -1794,6 +1804,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	/* Pass down any tuple bound */
+	node->iss_ScanDesc->tuples_needed = node->iss_TuplesNeeded;
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
-- 
2.53.0



  [application/octet-stream] v13-0006-Add-interfaces-that-enable-index-prefetching.patch (250.2K, 16-v13-0006-Add-interfaces-that-enable-index-prefetching.patch)
  download | inline diff:
From 7edfba681dd7fbf6716fc6be4eb937b36660f313 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v13 06/19] Add interfaces that enable index prefetching.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
batches comprising all matching items from an index page, rather than
one match at a time.

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller one by one.
The amgetbatch interface is tightly coupled with the new approach to
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to the scan in the near future to perform
I/O prefetching.  Prefetching will be added by an upcoming commit.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).  Almost all existing callers that perform index scans now use
the new table_index_getnext_slot interface, regardless of whether the
underlying index AM uses amgetbatch or amgettuple.

The amgetbatch interface returns batches that hold a buffer pin on an
index page that can be used by the table AM as an interlock against
concurrent TID recycling by VACUUM.  Now heapam only needs to hold on to
such a pin for an instant -- except during scans that use a non-MVCC
snapshot.  Non-MVCC scans continue to need to hold the pin until all of
the batch's TIDs have been fetched from the heap.

This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
and generalizes it to work with all index AMs that support the new
amgetbatch interface.  We can always safely drop index page pins
eagerly, provided the scan uses an MVCC snapshot (unlike the nbtree
dropPin optimization, which had a couple of additional restrictions).

An upcoming commit that will add index prefetching will use a read
stream to read heap pages during index scans.  Read stream is careful to
limit how many things it pins, lest we run into problems due to having
too many buffers pinned.  Simply never holding on to index page buffer
pins greatly simplifies resource management for index prefetching;
there's no risk of unintended interactions between the read stream and
index AM.  The only downside is that we cannot support prefetching
during scans that use a non-MVCC snapshot, which seems quite acceptable.

In practice, heapam doesn't drop each batch's index page buffer pin at
the earliest opportunity during index-only scans.  This was deemed
necessary to avoid regressing index-only scans with a LIMIT, in
particular with nestloop anti-joins and nestloop semi-joins; eagerly
loading all the visibility information up front regressed such queries.
The new amgetbatch interface gives table AMs the authority to decide
when and where to drop index page pins, so this can be considered a
heapam implementation detail (index AMs don't need to know about it).
This scheme still allows index prefetching to consistently hold no more
than one batch index page pin at a time, even when an index-only scan
(that must perform some heap fetches) holds open several index batches
at once in order to maintain an adequate prefetch distance.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
---
 src/include/access/amapi.h                    |  27 +-
 src/include/access/genam.h                    |  31 +-
 src/include/access/heapam.h                   |  27 +-
 src/include/access/nbtree.h                   | 188 ++---
 src/include/access/relscan.h                  | 326 +++++++-
 src/include/access/tableam.h                  |  65 ++
 src/include/executor/instrument_node.h        |   6 +
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   6 +-
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   6 +-
 src/backend/access/gist/gist.c                |   6 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 691 +++++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/amapi.c              |   5 +
 src/backend/access/index/genam.c              |  18 +-
 src/backend/access/index/indexam.c            | 169 ++--
 src/backend/access/index/indexbatch.c         | 752 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |  66 +-
 src/backend/access/nbtree/nbtpage.c           |  13 +-
 src/backend/access/nbtree/nbtreadpage.c       | 193 +++--
 src/backend/access/nbtree/nbtree.c            | 470 ++++++-----
 src/backend/access/nbtree/nbtsearch.c         | 542 ++++++-------
 src/backend/access/nbtree/nbtutils.c          | 245 ------
 src/backend/access/nbtree/nbtxlog.c           |   6 +-
 src/backend/access/spgist/spgutils.c          |   6 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 108 +--
 src/backend/executor/nodeIndexscan.c          |  13 +-
 src/backend/executor/nodeMergejoin.c          |   4 +-
 src/backend/optimizer/path/indxpath.c         |   6 +-
 src/backend/optimizer/util/plancat.c          |   8 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 src/backend/utils/adt/selfuncs.c              |  61 +-
 contrib/amcheck/verify_nbtree.c               |   2 +-
 contrib/bloom/blutils.c                       |   6 +-
 doc/src/sgml/indexam.sgml                     | 436 ++++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   6 +-
 src/tools/pgindent/typedefs.list              |  10 +-
 49 files changed, 3118 insertions(+), 1495 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index ecfbd017d..b8c247caf 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,19 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan,
+											   IndexScanBatch priorbatch,
+											   ScanDirection direction);
+
+/* mark dead items in index page */
+typedef void (*amkillitemsbatch_function) (IndexScanDesc scan,
+										   IndexScanBatch batch);
+
+/* release batch resources held to prevent concurrent TID recycling */
+typedef void (*amreleasebatch_function) (IndexScanDesc scan,
+										 IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +218,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +320,12 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amkillitemsbatch_function amkillitemsbatch; /* can be NULL */
+	amreleasebatch_function amreleasebatch;
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4c0429cc6..849075ac3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -94,6 +94,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -154,6 +155,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool xs_want_itup,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -180,14 +182,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool xs_want_itup,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -251,4 +251,29 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+extern void index_batchscan_init(IndexScanDesc scan);
+extern void index_batchscan_reset(IndexScanDesc scan);
+extern void index_batchscan_end(IndexScanDesc scan);
+extern void index_batchscan_mark_pos(IndexScanDesc scan);
+extern void index_batchscan_restore_pos(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by table AMs (in indexbatch.c)
+ */
+extern void tableam_util_batch_dirchange(IndexScanDesc scan);
+extern void tableam_util_kill_scanpositem(IndexScanDesc scan);
+extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch);
+extern void tableam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
+									  Buffer buf);
+extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a859c90f4..136019925 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,9 +118,34 @@ typedef struct IndexFetchHeapData
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
 	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
-	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	/* For index-only scans that must access the visibility map */
+	Buffer		xs_vmbuf;		/* visibility map buffer */
+	int			xs_vm_items;	/* # items to resolve visibility info for */
+
+	bool		xs_lastinblock; /* last TID on this block in current batch? */
+
+	/* NB: if xs_cbuf or vmbuf are not InvalidBuffer, we hold a pin */
 } IndexFetchHeapData;
 
+/*
+ * Per-batch data private to the heap table AM.
+ *
+ * Stored at a negative offset from the IndexScanBatch pointer, in the
+ * table AM opaque area of each batch allocation.
+ */
+typedef struct HeapBatchData
+{
+	uint8	   *visInfo;		/* per-item visibility flags, or NULL */
+} HeapBatchData;
+
+/* Access the heap-private per-batch data from an IndexScanBatch pointer */
+static inline HeapBatchData *
+heap_batch_data(IndexScanBatch batch, IndexScanDesc scan)
+{
+	return (HeapBatchData *) ((char *) batch - scan->batch_table_offset);
+}
+
 /* Result codes for HeapTupleSatisfiesVacuum */
 typedef enum
 {
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index da7503c57..fedc4b4e3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,111 +924,27 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
+/* Per-batch data private to the btree index AM */
+typedef struct BTBatchData
+{
+	Buffer		buf;			/* index page buffer pin (TID reuse interlock) */
+	BlockNumber currPage;		/* index page with matching items */
+	BlockNumber prevPage;		/* currPage's left sibling */
+	BlockNumber nextPage;		/* currPage's right sibling */
+	bool		moreLeft;		/* more matching pages to the left? */
+	bool		moreRight;		/* more matching pages to the right? */
+} BTBatchData;
+
 /*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
+ * Access the btree-private per-batch data from an IndexScanBatch pointer.
+ * This follows the standard convention for index AM opaque state: it can be
+ * found at a fixed negative offset from the IndexScanBatch pointer.
  */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
+static inline BTBatchData *
+BTBatchGetData(IndexScanBatch batch)
 {
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+	return (BTBatchData *) ((char *) batch - MAXALIGN(sizeof(BTBatchData)));
+}
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1050,6 +966,28 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +1004,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1072,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan,
+								 IndexScanBatch priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void btreleasebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch batch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1186,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,15 +1203,15 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access, bool returnstack);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
  * prototypes for functions in nbtutils.c
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
-extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076..ede5e6aa3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,8 +16,10 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
+#include "storage/buf.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -122,8 +124,177 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+
+	/* Table AM per-batch opaque area size (MAXALIGN'd), set by AM */
+	uint16		batch_opaque_size;
+
+	/* Per-item trailing data size in each batch */
+	uint16		batch_per_item_size;
 } IndexFetchTableData;
 
+/*
+ * Location of a BatchMatchingItem that appears in a IndexScanBatch returned
+ * by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchRingItemPos
+{
+	/* Position references a valid BatchRingBuffer.batches[] entry? */
+	bool		valid;
+
+	/* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatch */
+	uint8		batch;
+
+	/* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+
+} BatchRingItemPos;
+
+/*
+ * Matching item returned by amgetbatch (in returned IndexScanBatch) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData tableTid;	/* TID of referenced table item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Per-item visibility flags for index-only scans.  Stored in a separate
+ * array (IndexScanBatchData.visInfo) rather than in BatchMatchingItem to keep
+ * the hot items array compact.
+ */
+#define BATCH_VIS_CHECKED		0x01	/* checked item in VM? */
+#define BATCH_VIS_ALL_VISIBLE	0x02	/* block is known all-visible? */
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans.
+ *
+ * Each batch allocation has the following memory layout:
+ *
+ *   [table AM opaque area]    <- at -(batch_table_offset) from batch ptr
+ *   [index AM opaque area]    <- at -(batch_index_opaque_size) from batch ptr
+ *   [IndexScanBatchData]      <- the returned pointer
+ *   [items[maxitemsbatch]]
+ *   [table AM trailing data]  <- e.g. per-item visibility flags
+ *   [currTuples workspace]    <- sized by index AM (batch_tuples_workspace)
+ *
+ * The AM-specific opaque areas are accessed via accessor functions defined by
+ * each table AM and index AM that supports the batch interfaces.
+ */
+typedef struct IndexScanBatchData
+{
+	XLogRecPtr	lsn;			/* index page's LSN */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * knownEndBackward and knownEndForward are set by the table AM to
+	 * indicate that this batch is the last one with matching items in the
+	 * relevant scan direction.  When amgetbatch returns NULL for a given
+	 * direction, the table AM sets the corresponding flag on the priorbatch
+	 * that was passed to that call.  We cannot know this when a batch is
+	 * first returned by amgetbatch; it only becomes apparent when we try and
+	 * fail to continue the scan past it.
+	 *
+	 * This allows table AMs to avoid redundant amgetbatch calls with the same
+	 * priorbatch -- the index AM might need to read additional index pages to
+	 * determine there are no more matching items beyond caller's priorbatch.
+	 */
+	bool		knownEndBackward;
+	bool		knownEndForward;
+
+	/*
+	 * Matching items state for this batch.  Output by index AM for table AM.
+	 *
+	 * The items array is always ordered in index order (ie, by increasing
+	 * indexoffset).  When scanning backwards it is convenient for index AMs
+	 * to fill the array back-to-front, so we start at the last item slot and
+	 * fill downwards.  This is why we need both a first-valid-entry and a
+	 * last-valid-entry counter.
+	 *
+	 * Note: these are signed because it's sometimes convenient to use -1 to
+	 * represent an out-of-bounds space just before firstItem (when it's 0).
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about dead items if any (deadItems is NULL if never used) */
+	int			numDead;		/* number of currently stored items */
+	int		   *deadItems;		/* indexes of dead items */
+
+	/*
+	 * If we are doing an index-only scan, this is the tuple storage workspace
+	 * for the matching tuples (tuples referenced by items[]).  The workspace
+	 * size is determined by the index AM (batch_tuples_workspace).
+	 *
+	 * currTuples points into the trailing portion of this allocation, past
+	 * items[] and any table AM trailing data.  It is NULL for plain index
+	 * scans.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */
+} IndexScanBatchData;
+
+typedef struct IndexScanBatchData *IndexScanBatch;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ *
+ * The current maximum of 64 batches is somewhat of an arbitrary limit.  Very
+ * few scans ever get near to this limit in practice.
+ */
+#define INDEX_SCAN_MAX_BATCHES		64
+#define INDEX_SCAN_CACHE_BATCHES	2
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans use a ring buffer of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ */
+typedef struct BatchRingBuffer
+{
+	/* current positions in batches[] for scan */
+	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos markPos;	/* mark/restore position */
+
+	IndexScanBatch markBatch;
+
+	/*
+	 * headBatch is an index to the earliest still-valid batch in 'batches'.
+	 * In practice this must be the scan's current scanPos batch (scanBatch).
+	 */
+	uint8		headBatch;
+
+	/*
+	 * nextBatch is an index to the next empty batch slot in 'batches'.  This
+	 * is only actually usable when the scan is !index_scan_batch_full().
+	 */
+	uint8		nextBatch;
+
+	/*
+	 * Should indexam_util_batch_release save caller's batch in cache[]?
+	 */
+	bool		done;
+
+	/* Array of pointers to cached recyclable batches */
+	IndexScanBatch cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to ring buffer batches */
+	IndexScanBatch batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchRingBuffer;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -141,6 +312,20 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* table access method's private amgetbatch state */
+	BatchRingBuffer batchringbuf;	/* amgetbatch related state */
+
+	bool		usebatchring;	/* scan uses amgetbatch/batchringbuf? */
+	bool		batchImmediateRelease;	/* AM releases batch resources in
+										 * indexam_util_batch_unlock? */
+
+	/* Cached batch for amgetbitmap callers (avoids repeated alloc/free) */
+	IndexScanBatch xs_bitmap_batch;
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -149,9 +334,8 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
-
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/* xs_snapshot uses an MVCC snapshot? */
+	bool		MVCCScan;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -160,10 +344,10 @@ typedef struct IndexScanDescData
 	struct IndexScanInstrumentation *instrument;
 
 	/*
-	 * In an index-only scan, a successful amgettuple call must fill either
-	 * xs_itup (and xs_itupdesc) or xs_hitup (and xs_hitupdesc) to provide the
-	 * data returned by the scan.  It can fill both, in which case the heap
-	 * format will be used.
+	 * In an index-only scan, a successful table_index_getnext_slot call must
+	 * fill either xs_itup (and xs_itupdesc) or xs_hitup (and xs_hitupdesc) to
+	 * provide the data returned by the scan.  It can fill both, in which case
+	 * the heap format will be used.
 	 */
 	IndexTuple	xs_itup;		/* index tuple returned by AM */
 	struct TupleDescData *xs_itupdesc;	/* rowtype descriptor of xs_itup */
@@ -176,6 +360,14 @@ typedef struct IndexScanDescData
 	IndexFetchTableData *xs_heapfetch;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	uint16		maxitemsbatch;	/* set by ambeginscan when amgetbatch used */
+
+	/* Per-batch opaque area sizes, set by index AM in ambeginscan */
+	uint16		batch_index_opaque_size;	/* MAXALIGN'd index AM opaque size */
+	uint16		batch_tuples_workspace; /* currTuples workspace size */
+
+	/* Computed offset from batch pointer to table AM opaque (includes both) */
+	uint16		batch_table_offset;
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -215,4 +407,124 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Return the true allocation base of a batch (accounting for AM opaque areas
+ * stored before the IndexScanBatchData pointer).
+ */
+static inline void *
+batch_alloc_base(IndexScanBatch batch, IndexScanDescData *scan)
+{
+	return (char *) batch - scan->batch_table_offset;
+}
+
+/*
+ * Count how many batches are currently loaded in the ring buffer.
+ */
+static inline uint8
+index_scan_batch_count(IndexScanDescData *scan)
+{
+	return (uint8) (scan->batchringbuf.nextBatch -
+					scan->batchringbuf.headBatch);
+}
+
+/*
+ * Did we already load batch with the requested index?
+ */
+static inline bool
+index_scan_batch_loaded(IndexScanDescData *scan, uint8 idx)
+{
+	return (int8) (idx - scan->batchringbuf.headBatch) >= 0 &&
+		(int8) (idx - scan->batchringbuf.nextBatch) < 0;
+}
+
+/*
+ * Have we loaded the maximum number of batches?
+ */
+static inline bool
+index_scan_batch_full(IndexScanDescData *scan)
+{
+	return index_scan_batch_count(scan) == INDEX_SCAN_MAX_BATCHES;
+}
+
+/*
+ * Return batch for the provided index.
+ */
+static inline IndexScanBatch
+index_scan_batch(IndexScanDescData *scan, uint8 idx)
+{
+	Assert(index_scan_batch_loaded(scan, idx));
+
+	return scan->batchringbuf.batches[idx & (INDEX_SCAN_MAX_BATCHES - 1)];
+}
+
+/*
+ * Append given batch to scan's batch ring buffer.
+ */
+static inline void
+index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)
+{
+	BatchRingBuffer *ringbuf = &scan->batchringbuf;
+	uint8		nextBatch = ringbuf->nextBatch;
+
+	ringbuf->batches[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch;
+	ringbuf->nextBatch++;
+}
+
+/*
+ * Advance position to its next item in the batch.
+ *
+ * Advance to the next item within the provided batch (or to the previous item,
+ * when scanning backwards).
+ *
+ * Returns true if the position could be advanced.  Returns false when there
+ * are no more items in the batch in the given direction.
+ */
+static inline bool
+index_scan_pos_advance(ScanDirection direction,
+					   IndexScanBatch batch, BatchRingItemPos *pos)
+{
+	Assert(pos->valid);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item > batch->lastItem)
+			return false;
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item < batch->firstItem)
+			return false;
+	}
+
+	/* Advanced within batch */
+	return true;
+}
+
+/*
+ * Advance batch position to the start of its new batch.
+ *
+ * Sets the given position to the fist item in the given scan direction (or to
+ * the last item, when scanning backwards).   Also advances/increments batch
+ * offset from position such that it points to newBatchForPos.
+ */
+static inline void
+index_scan_pos_nextbatch(ScanDirection direction,
+						 IndexScanBatch newBatch, BatchRingItemPos *pos)
+{
+	Assert(newBatch->dir == direction);
+
+	/* Increment batch (often wraps uint8 batch field) */
+	if (pos->valid)
+		pos->batch++;
+	else
+		pos->batch = 0;
+
+	pos->valid = true;
+
+	if (ScanDirectionIsForward(direction))
+		pos->item = newBatch->firstItem;
+	else
+		pos->item = newBatch->lastItem;
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 060847522..3011f4eda 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -433,11 +433,42 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Initialize table AM's per-batch opaque area within a batch allocation.
+	 *
+	 * Called by indexam_util_batch_alloc for each new or recycled batch.
+	 * Table AMs should set up its opaque area (at a negative offset from the
+	 * batch pointer) and any trailing per-item data (e.g. visibility flags).
+	 *
+	 * 'new_alloc' is true for freshly palloc'd batches, false for batches
+	 * recycled from the cache.
+	 */
+	void		(*index_batch_init) (IndexScanDesc scan, IndexScanBatch batch,
+									 bool new_alloc);
+
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -1207,6 +1238,37 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Initialize table AM's per-batch opaque area within a batch allocation.
+ *
+ * Called by indexam_util_batch_alloc for each new or recycled batch.
+ */
+static inline void
+table_index_batch_init(IndexScanDesc scan, IndexScanBatch batch, bool new_alloc)
+{
+	scan->heapRelation->rd_tableam->index_batch_init(scan, batch, new_alloc);
+}
+
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
+ * can consume index tuple results by examining IndexScanDescData fields such
+ * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
+ * into the provided slot in the case of xs_want_itup=true callers.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	return iscan->heapRelation->rd_tableam->index_getnext_slot(iscan,
+															   direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1230,6 +1292,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94..b5b8f509a 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -48,6 +48,12 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * heap blocks fetched counts (incremented by index_getnext_slot calls
+	 * within table AMs, though only during index-only scans)
+	 */
+	uint64		nheapfetches;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 51782d1fc..7c1b427fb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,7 +1754,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1777,7 +1776,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation *ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 27758ec16..0661fc03d 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1425,12 +1425,12 @@ typedef struct IndexOptInfo
 	bool		amoptionalkey;
 	bool		amsearcharray;
 	bool		amsearchnulls;
-	/* does AM have amgettuple interface? */
-	bool		amhasgettuple;
+	/* does AM have amgetbatch (or gettuple) interface? */
+	bool		amhasgetbatch;
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1909c3254..7ad8443eb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -296,10 +296,12 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ff927279c..0c0e895e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,12 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8565e225b..a88de39d4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,12 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e88ddb32a..5b5c5c6fa 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d7b05aa14..73ae4925e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,7 @@
  */
 #include "postgres.h"
 
+#include "access/amapi.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -84,8 +85,10 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
-	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_base.batch_opaque_size = MAXALIGN(sizeof(HeapBatchData));
+	hscan->xs_base.batch_per_item_size = sizeof(uint8); /* visInfo element size */
 	hscan->xs_blk = InvalidBlockNumber;
+	hscan->xs_vm_items = 1;
 
 	return &hscan->xs_base;
 }
@@ -95,12 +98,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
-	if (BufferIsValid(hscan->xs_cbuf))
-	{
-		ReleaseBuffer(hscan->xs_cbuf);
-		hscan->xs_cbuf = InvalidBuffer;
-	}
-	hscan->xs_blk = InvalidBlockNumber;
+	/* Rescans should avoid an excessive number of VM lookups */
+	hscan->xs_vm_items = 1;
+
+	/*
+	 * Deliberately avoid dropping any pins now held in xs_cbuf and xs_vmbuf.
+	 * This saves cycles during certain tight nested loop joins, and during
+	 * merge joins that frequently restore a saved mark.  It can also avoid
+	 * repeated pinning and unpinning of the same buffer across rescans.
+	 */
 }
 
 static void
@@ -108,11 +114,53 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
-	heapam_index_fetch_reset(scan);
+	/* drop pin if there's a pinned heap page */
+	if (BufferIsValid(hscan->xs_cbuf))
+		ReleaseBuffer(hscan->xs_cbuf);
+
+	/* drop pin if there's a pinned visibility map page */
+	if (BufferIsValid(hscan->xs_vmbuf))
+		ReleaseBuffer(hscan->xs_vmbuf);
 
 	pfree(hscan);
 }
 
+/*
+ * Initialize the heap table AM's per-batch opaque area (HeapBatchData).
+ *
+ * Called by indexam_util_batch_alloc for each new or recycled batch.
+ * Sets up the visInfo pointer for index-only scans, or NULL otherwise.
+ */
+static void
+heapam_index_batch_init(IndexScanDesc scan, IndexScanBatch batch,
+						bool new_alloc)
+{
+	HeapBatchData *hbatch = heap_batch_data(batch, scan);
+
+	if (scan->xs_want_itup)
+	{
+		if (new_alloc)
+		{
+			/*
+			 * Point visInfo into the trailing per-item area that follows
+			 * items[] in the batch allocation.
+			 */
+			Size		itemsEnd;
+
+			itemsEnd = MAXALIGN(offsetof(IndexScanBatchData, items) +
+								sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+			hbatch->visInfo = (uint8 *) ((char *) batch + itemsEnd);
+		}
+
+		/* Clear visibility flags (needed for both new and recycled batches) */
+		memset(hbatch->visInfo, 0, scan->maxitemsbatch);
+	}
+	else
+	{
+		hbatch->visInfo = NULL;
+	}
+}
+
 static bool
 heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
@@ -134,6 +182,12 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Remember this buffer's block number for next time */
 		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
+		/*
+		 * Drop the xs_blk pin independently held on by slot (if any) now. See
+		 * comments around ExecStorePinnedBufferHeapTuple call below.
+		 */
+		ExecClearTuple(slot);
+
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
@@ -170,7 +224,33 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		*call_again = !IsMVCCSnapshot(snapshot);
 
 		slot->tts_tableOid = RelationGetRelid(scan->rel);
-		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
+
+		/*
+		 * If this is the last TID on the current heap block within the batch,
+		 * transfer our buffer pin to the slot rather than having the slot
+		 * increment the pin count.  This saves a pair of IncrBufferRefCount
+		 * and ReleaseBuffer calls, since the caller would just release its
+		 * pin on xs_cbuf when switching to the next block anyway.
+		 *
+		 * We can only do this when call_again is false, since otherwise the
+		 * caller will need xs_cbuf to remain valid for the next call.
+		 */
+		if (hscan->xs_lastinblock && !*call_again)
+		{
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata, slot,
+										   hscan->xs_cbuf);
+			hscan->xs_cbuf = InvalidBuffer;
+			hscan->xs_blk = InvalidBlockNumber;
+
+			/*
+			 * Note: the pin now owned by the slot is expected to be released
+			 * on the next call here, via an explicit ExecClearTuple.  This
+			 * avoids churn in the backend's private refcount cache.
+			 */
+		}
+		else
+			ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot,
+									 hscan->xs_cbuf);
 	}
 	else
 	{
@@ -181,6 +261,591 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+/*
+ * heapam_batch_resolve_visibility
+ *		Obtain visibility information for a TID from caller's batch.
+ *
+ * Called during index-only scans.  We always check the visibility of caller's
+ * item (an offset into caller's batch->items[] array).  We might also set
+ * visibility info for other items from caller's batch more proactively when
+ * that makes sense.
+ *
+ * We keep two competing considerations in balance when determining whether to
+ * check additional items: the need to keep the cost of visibility map access
+ * under control when most items will never be returned by the scan anyway
+ * (important for inner index scans of anti-joins and semi-joins), and the
+ * need to not hold onto index leaf pages for too long.
+ *
+ * Note on Memory Ordering Effects
+ * -------------------------------
+ *
+ * visibilitymap_get_status does not lock the visibility map buffer, and
+ * therefore the result we read here could be slightly stale.  However, it
+ * can't be stale enough to matter.
+ *
+ * We need to detect clearing a VM bit due to an insert right away, because
+ * the tuple is present in the index page but not visible.  The reading of the
+ * TID by this scan (using a shared lock on the index buffer) is serialized
+ * with the insert of the TID into the index (using an exclusive lock on the
+ * index buffer).  Because the VM bit is cleared before updating the index,
+ * and locking/unlocking of the index page acts as a full memory barrier, we
+ * are sure to see the cleared bit if we see a recently-inserted TID.
+ *
+ * Deletes do not update the index page (only VACUUM will clear out the TID),
+ * so the clearing of the VM bit by a delete is not serialized with this test
+ * below, and we may see a value that is significantly stale.  However, we
+ * don't care about the delete right away, because the tuple is still visible
+ * until the deleting transaction commits or the statement ends (if it's our
+ * transaction).  In either case, the lock on the VM buffer will have been
+ * released (acting as a write barrier) after clearing the bit.  And for us to
+ * have a snapshot that includes the deleting transaction (making the tuple
+ * invisible), we must have acquired ProcArrayLock after that time, acting as
+ * a read barrier.
+ *
+ * It's worth going through this complexity to avoid needing to lock the VM
+ * buffer, which could cause significant contention.
+ */
+static void
+heapam_batch_resolve_visibility(IndexScanDesc scan, ScanDirection direction,
+								IndexScanBatch batch, HeapBatchData *hbatch,
+								BatchRingItemPos *pos)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	int			posItem = pos->item;
+	int			noSetItem,
+				step;
+	bool		allbatchitemvisible;
+	BlockNumber curvmheapblkno = InvalidBlockNumber;
+	uint8		curvmheapblkflags = 0;
+
+	Assert(hbatch == heap_batch_data(batch, scan));
+
+	/*
+	 * We better still have index AM TID recycling interlock (generally a pin
+	 * on its index page) held for this batch
+	 */
+	Assert(!scan->batchImmediateRelease);
+
+	/* Determine the range of items to set visibility for */
+	if (ScanDirectionIsForward(direction))
+	{
+		noSetItem = Min(batch->lastItem + 1, posItem + hscan->xs_vm_items);
+		allbatchitemvisible = noSetItem > batch->lastItem &&
+			(posItem == batch->firstItem ||
+			 (hbatch->visInfo[batch->firstItem] & BATCH_VIS_CHECKED));
+		step = 1;
+	}
+	else
+	{
+		noSetItem = Max(batch->firstItem - 1, posItem - hscan->xs_vm_items);
+		allbatchitemvisible = noSetItem < batch->firstItem &&
+			(posItem == batch->lastItem ||
+			 (hbatch->visInfo[batch->lastItem] & BATCH_VIS_CHECKED));
+		step = -1;
+	}
+
+	/*
+	 * Set visibility info for a range of items, in scan order.
+	 *
+	 * noSetItem is the first item (in the given scan direction) that won't be
+	 * set during this call.  noSetItem often points to just past the end of
+	 * (or just before the start of) the batch's 'items' array.
+	 *
+	 * We iterate this way to avoid the need for 2 direction-specific loops,
+	 * since this is a hot code path that's sensitive to code size increases.
+	 */
+	for (int setItem = posItem; setItem != noSetItem; setItem += step)
+	{
+		ItemPointer tid = &batch->items[setItem].tableTid;
+		BlockNumber heapblkno = ItemPointerGetBlockNumber(tid);
+		uint8		flags;
+
+		if (heapblkno == curvmheapblkno)
+		{
+			/* contiguous heap block -- just reuse last item's flags */
+			hbatch->visInfo[setItem] = curvmheapblkflags;
+			continue;
+		}
+
+		flags = BATCH_VIS_CHECKED;
+		if (VM_ALL_VISIBLE(scan->heapRelation, heapblkno, &hscan->xs_vmbuf))
+			flags |= BATCH_VIS_ALL_VISIBLE;
+
+		hbatch->visInfo[setItem] = curvmheapblkflags = flags;
+		curvmheapblkno = heapblkno;
+	}
+
+	/*
+	 * It's safe to drop the batch's index AM resources as soon as we've
+	 * resolved the visibility status of all of its items
+	 */
+	if (allbatchitemvisible && scan->MVCCScan)
+	{
+		Assert(hbatch->visInfo[batch->firstItem] & BATCH_VIS_CHECKED);
+		Assert(hbatch->visInfo[batch->lastItem] & BATCH_VIS_CHECKED);
+
+		tableam_util_release_batch(scan, batch);
+	}
+
+	/*
+	 * Else check visibility for twice as many items next time, or all items.
+	 * We check all items in one go once we're passed the scan's first batch.
+	 */
+	else if (hscan->xs_vm_items < (batch->lastItem - batch->firstItem))
+		hscan->xs_vm_items *= 2;
+	else
+		hscan->xs_vm_items = scan->maxitemsbatch;
+}
+
+static inline ItemPointer
+heapam_batch_return_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
+						ScanDirection direction, IndexScanBatch scanBatch,
+						BatchRingItemPos *scanPos, bool *all_visible)
+{
+	HeapBatchData *hbatch;
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Set xs_heaptid, which heapam_index_getnext_slot will need */
+	scan->xs_heaptid = scanBatch->items[scanPos->item].tableTid;
+
+	if (!scan->xs_want_itup)
+	{
+		int			nextItem;
+		bool		hasNext;
+
+		/*
+		 * Plain index scan.
+		 *
+		 * Determine if the next item in the current scan direction is on a
+		 * different heap block.  When it is, heapam_index_fetch_tuple can
+		 * transfer its buffer pin to the slot instead of incrementing the pin
+		 * count, saving a pair of IncrBufferRefCount/ReleaseBuffer calls.
+		 *
+		 * Note: We cannot do this for index-only scans because all-visible
+		 * items are skipped by both the scan and the read stream callback.
+		 * Skipped items can break the block deduplication symmetry between
+		 * the stream and the scan: the stream deduplicates consecutive
+		 * non-all-visible items by block, but after invalidating xs_blk the
+		 * scan would try to re-fetch a block that the stream already returned
+		 * and deduplicated away.
+		 */
+		if (ScanDirectionIsForward(direction))
+		{
+			nextItem = scanPos->item + 1;
+			hasNext = (nextItem <= scanBatch->lastItem);
+		}
+		else
+		{
+			nextItem = scanPos->item - 1;
+			hasNext = (nextItem >= scanBatch->firstItem);
+		}
+
+		hscan->xs_lastinblock = hasNext &&
+			ItemPointerGetBlockNumber(&scanBatch->items[nextItem].tableTid) !=
+			ItemPointerGetBlockNumber(&scan->xs_heaptid);
+
+		return &scan->xs_heaptid;
+	}
+
+	/*
+	 * Index-only scan.
+	 *
+	 * Also set xs_itup, which heapam_index_getnext_slot needs too.
+	 */
+	scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
+								  scanBatch->items[scanPos->item].tupleOffset);
+
+	/*
+	 * Set visibility info for the current scanPos item (plus possibly some
+	 * additional items in the current scan direction) as needed
+	 */
+	hbatch = heap_batch_data(scanBatch, scan);
+	if (!(hbatch->visInfo[scanPos->item] & BATCH_VIS_CHECKED))
+		heapam_batch_resolve_visibility(scan, direction, scanBatch, hbatch,
+										scanPos);
+
+	/* Finally, set all_visible for heapam_index_getnext_slot */
+	*all_visible =
+		(hbatch->visInfo[scanPos->item] & BATCH_VIS_ALL_VISIBLE) != 0;
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ *		heapam_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called when we need to load the next batch of index entries to process in
+ * the given direction.  Caller passes us a batch and a batch position, which
+ * has just been used to read all items from the batch in the direction passed
+ * by caller.
+ *
+ * Returns the next batch to be processed by the index scan, or NULL when
+ * there are no more matches in the given scan direction.  Does not advance
+ * caller's batch position; that is left up to caller.
+ *
+ * This is also where batches are appended to the scan's ring buffer.  We
+ * don't free any batches here, though; that is also left up to caller.
+ * ----------------
+ */
+static pg_attribute_hot IndexScanBatch
+heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
+					 IndexScanBatch priorBatch, BatchRingItemPos *pos)
+{
+	IndexScanBatch batch = NULL;
+	BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	if (!priorBatch)
+	{
+		/* First call for the scan */
+		Assert(pos == &batchringbuf->scanPos);
+	}
+	else if (unlikely(priorBatch->dir != direction))
+	{
+		/*
+		 * We detected a change in scan direction across batches.  Prepare
+		 * scan's batchringbuf state for us to get the next batch for the
+		 * opposite scan direction to the one used when priorBatch was
+		 * returned by amgetbatch.
+		 */
+		tableam_util_batch_dirchange(scan);
+
+		/* priorBatch is now batchringbuf's only batch */
+		Assert(pos->batch == batchringbuf->headBatch);
+		Assert(index_scan_batch_count(scan) == 1);
+	}
+	else if (index_scan_batch_loaded(scan, pos->batch + 1))
+	{
+		/* Next batch already loaded for us */
+		batch = index_scan_batch(scan, pos->batch + 1);
+
+		Assert(priorBatch->dir == direction);
+		Assert(batch->dir == direction);
+		return batch;
+	}
+
+	/*
+	 * Assert preconditions for calling amgetbatch.
+	 *
+	 * priorBatch had better be for the last valid batch currently in the ring
+	 * buffer (batches must stay in scan order).  If it isn't then we should
+	 * have already returned some existing loaded batch earlier.
+	 */
+	Assert(!index_scan_batch_full(scan));
+	Assert(!priorBatch ||
+		   (index_scan_batch_count(scan) > 0 && priorBatch->dir == direction &&
+			index_scan_batch(scan, batchringbuf->nextBatch - 1) == priorBatch));
+
+	/*
+	 * Before we call amgetbatch again, check if priorBatch is already known
+	 * to be the last batch with matching items in this scan direction
+	 */
+	if (priorBatch &&
+		((ScanDirectionIsForward(direction) && priorBatch->knownEndForward) ||
+		 (ScanDirectionIsBackward(direction) && priorBatch->knownEndBackward)))
+		return NULL;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorBatch,
+													  direction);
+	if (batch)
+	{
+		/* We got the batch from the AM */
+		Assert(batch->dir == direction);
+
+		/* Append batch to the end of ring buffer/write it to buffer index */
+		index_scan_batch_append(scan, batch);
+	}
+	else
+	{
+		/* amgetbatch returned NULL */
+		if (priorBatch)
+		{
+			/*
+			 * There are no further matches to be found in the current scan
+			 * direction, following priorBatch.  Remember that priorBatch is
+			 * the last batch with matching items.
+			 */
+			if (ScanDirectionIsForward(direction))
+				priorBatch->knownEndForward = true;
+			else
+				priorBatch->knownEndBackward = true;
+		}
+	}
+
+	/* xs_hitup isn't currently supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	return batch;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from batch ring buffer
+ *
+ * Get the next TID from the scan's batch ring buffer, when moving in the
+ * given scan direction.
+ * ----------------
+ */
+static pg_attribute_hot ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
+						 ScanDirection direction, bool *all_visible)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	IndexScanBatch scanBatch = NULL;
+
+	Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch);
+	Assert(scanPos->valid || index_scan_batch_count(scan) == 0);
+
+	/*
+	 * Check if there's an existing loaded scanBatch for us to return the next
+	 * matching item's TID/index tuple from
+	 */
+	if (scanPos->valid)
+	{
+		/*
+		 * scanPos is valid, so scanBatch must already be loaded in batch ring
+		 * buffer.  We rely on that here.
+		 */
+		Assert(batchringbuf->headBatch == scanPos->batch);
+
+		scanBatch = index_scan_batch(scan, scanPos->batch);
+
+		if (index_scan_pos_advance(direction, scanBatch, scanPos))
+			return heapam_batch_return_tid(scan, hscan, direction,
+										   scanBatch, scanPos,
+										   all_visible);
+	}
+
+	/*
+	 * Either ran out of items from our existing scanBatch, or it hasn't been
+	 * loaded yet (because this is the first call here for the entire scan).
+	 * Try to advance scanBatch to the next batch (or get the first batch).
+	 */
+	scanBatch = heapam_batch_getnext(scan, direction, scanBatch, scanPos);
+
+	if (!scanBatch)
+	{
+		/*
+		 * We're done; no more batches in the current scan direction.
+		 *
+		 * Note: scanPos is generally still valid at this point.  The scan
+		 * might still back up in the other direction.
+		 */
+		return NULL;
+	}
+
+	/*
+	 * Advanced scanBatch.  Now position scanPos to the start of new
+	 * scanBatch.
+	 */
+	index_scan_pos_nextbatch(direction, scanBatch, scanPos);
+	Assert(index_scan_batch(scan, scanPos->batch) == scanBatch);
+
+	/*
+	 * Remove the head batch from the batch ring buffer (except when this new
+	 * scanBatch is our only one)
+	 */
+	if (batchringbuf->headBatch != scanPos->batch)
+	{
+		IndexScanBatch headBatch = index_scan_batch(scan,
+													batchringbuf->headBatch);
+
+		/* free obsolescent head batch (unless it is scan's markBatch) */
+		tableam_util_free_batch(scan, headBatch);
+
+		/* Remove the batch from the ring buffer */
+		batchringbuf->headBatch++;
+	}
+
+	/* In practice scanBatch will always be the ring buffer's headBatch */
+	Assert(batchringbuf->headBatch == scanPos->batch);
+
+	return heapam_batch_return_tid(scan, hscan, direction, scanBatch, scanPos,
+								   all_visible);
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static pg_attribute_hot bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->usebatchring)
+		{
+			if (all_dead)
+				tableam_util_kill_scanpositem(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static pg_attribute_hot bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+	bool		all_visible = false;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->usebatchring)
+				tid = heapam_batch_getnext_tid(scan, hscan, direction,
+											   &all_visible);
+			else
+			{
+				tid = index_getnext_tid(scan, direction);
+
+				if (tid != NULL && scan->xs_want_itup)
+					all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+												 ItemPointerGetBlockNumber(tid),
+												 &hscan->xs_vmbuf);
+			}
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 */
+			if (!all_visible)
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->nheapfetches++;
+
+				if (!index_fetch_heap(scan, slot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(scan->xs_heap_continue &&
+						 IsMVCCSnapshot(scan->xs_snapshot)));
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			/*
+			 * Return matching index tuple now set in scan->xs_itup (or return
+			 * matching heap tuple now set in scan->xs_hitup).
+			 *
+			 * Note: we won't usually have fetched a heap tuple into caller's
+			 * table slot.  This is per the table_index_getnext_slot contract
+			 * for scan->xs_want_itup callers.
+			 */
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -761,7 +1426,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -798,7 +1464,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2657,6 +3324,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_batch_init = heapam_index_batch_init,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/amapi.c b/src/backend/access/index/amapi.c
index efa007030..1de2a88ac 100644
--- a/src/backend/access/index/amapi.c
+++ b/src/backend/access/index/amapi.c
@@ -55,6 +55,11 @@ GetIndexAmRoutine(Oid amhandler)
 	Assert(routine->amrescan != NULL);
 	Assert(routine->amendscan != NULL);
 
+	/* Assert that AM doesn't have an invalid combination of callbacks */
+	Assert(routine->amkillitemsbatch == NULL || routine->amgetbatch != NULL);
+	Assert((routine->amgetbatch != NULL) == (routine->amreleasebatch != NULL));
+	Assert(routine->amgetbatch != NULL || routine->amposreset == NULL);
+
 	return routine;
 }
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a6..6e87169c2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->usebatchring = false; /* set later for amgetbatch callers */
+	scan->xs_bitmap_batch = NULL;
+	scan->xs_want_itup = false; /* caller must initialize this */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -102,8 +105,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	else
 		scan->orderByData = NULL;
 
-	scan->xs_want_itup = false; /* may be set later */
-
 	/*
 	 * During recovery we ignore killed tuples and don't bother to kill them
 	 * either. We do this because the xmin on the primary node could easily be
@@ -126,6 +127,10 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	scan->batch_index_opaque_size = 0;
+	scan->batch_tuples_workspace = 0;
+	scan->batch_table_offset = 0;
+
 	return scan;
 }
 
@@ -454,7 +459,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +522,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -715,7 +721,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	if (TransactionIdIsValid(CheckXidAlive))
 		bsysscan = true;
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +740,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 43f64a0e7..8d8be9442 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool xs_want_itup,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -281,7 +280,14 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+	scan->usebatchring = false;
+	scan->batchImmediateRelease = (scan->MVCCScan && !xs_want_itup);
+
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batchscan_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
@@ -312,6 +318,7 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
 	scan->instrument = instrument;
 
 	return scan;
@@ -373,13 +380,19 @@ index_rescan(IndexScanDesc scan,
 	Assert(nkeys == scan->numberOfKeys);
 	Assert(norderbys == scan->numberOfOrderBys);
 
-	/* Release resources (like buffer pins) from table accesses */
+	/* reset table AM state for rescan */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	if (scan->usebatchring)
+	{
+		Assert(!scan->batchringbuf.done);
+		index_batchscan_reset(scan);
+	}
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +407,17 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	if (scan->usebatchring)
+		index_batchscan_end(scan);
+
+	/* Free cached bitmap batch if any */
+	if (scan->xs_bitmap_batch != NULL)
+	{
+		pfree(batch_alloc_base(scan->xs_bitmap_batch, scan));
+		scan->xs_bitmap_batch = NULL;
+	}
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,24 +446,25 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	index_batchscan_mark_pos(scan);
 }
 
 /* ----------------
  *		index_restrpos	- restore a scan position
  *
- * NOTE: this only restores the internal scan state of the index AM.  See
- * comments for ExecRestrPos().
+ * NOTE: this only restores the batch positional state shared by the table and
+ * index AMs.  See comments for ExecRestrPos().
  *
  * NOTE: For heap, in the presence of HOT chains, mark/restore only works
  * correctly if the scan's snapshot is MVCC-safe; that ensures that there's at
  * most one returnable tuple in each HOT chain, and so restoring the prior
- * state at the granularity of the index AM is sufficient.  Since the only
- * current user of mark/restore functionality is nodeMergejoin.c, this
- * effectively means that merge-join plans only work for MVCC snapshots.  This
- * could be fixed if necessary, but for now it seems unimportant.
+ * state at the scan item granularity is sufficient.  Since the only current
+ * user of mark/restore functionality is nodeMergejoin.c, this effectively
+ * means that merge-join plans only work for MVCC snapshots.  This could be
+ * fixed if necessary, but for now it seems unimportant.
  * ----------------
  */
 void
@@ -448,16 +473,16 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
 
-	/* release resources (like buffer pins) from table accesses */
+	/* reset table AM state for rescan */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
-	scan->kill_prior_tuple = false; /* for safety */
-	scan->xs_heap_continue = false;
+	/* also notify table AM and index AM */
+	index_batchscan_restore_pos(scan);
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	scan->xs_heap_continue = false; /* for safety */
 }
 
 /*
@@ -579,6 +604,12 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	if (scan->usebatchring)
+	{
+		Assert(!scan->batchringbuf.done);
+		index_batchscan_reset(scan);
+	}
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +622,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool xs_want_itup,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -612,7 +644,13 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	 */
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCSnapshot(snapshot);
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+	scan->batchImmediateRelease = (scan->MVCCScan && !xs_want_itup);
+
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batchscan_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
@@ -621,10 +659,14 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +709,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..a9a72810f
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,752 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  Batch-based index scan infrastructure for the amgetbatch interface.
+ *
+ * This module provides the core infrastructure for batch-based index scans,
+ * which allow index AMs to return multiple matching TIDs per page in a single
+ * call.  The batch ring buffer is managed by the table AM, with help from us,
+ * and with help from the ring buffer inline functions in relscan.h.  This
+ * approach enables efficient prefetching of table AM blocks during ordered
+ * index scans.
+ *
+ * The ring buffer loads batches in index key space order.
+ *
+ * There's three types of functions in this module:
+ *
+ * 1. Core batch scan lifecycle (index_batchscan_*): Functions that manage
+ *    batch scan state including initialization, reset, cleanup, and the
+ *    mark/restore operations needed for merge joins.  Called by indexam.c
+ *    routines that manage index scans on behalf of the core executor.
+ *
+ * 2. Table AM utilities (tableam_util_*): Helper functions called by table
+ *    AMs during amgetbatch index scans.  These handle cross-batch direction
+ *    changes, recording dead items for a later call to amkillitemsbatch, and
+ *    freeing batches when the table AM is done with them.
+ *
+ * 3. Index AM utilities (indexam_util_*): Helper functions called by index
+ *    AMs that implement the amgetbatch interface.  These manage batch
+ *    allocation, index page buffer lock release, and batch memory recycling.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+
+static int	batch_compare_int(const void *va, const void *vb);
+
+/*
+ * Sets up the batch ring buffer structure for use by an index scan.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+index_batchscan_init(IndexScanDesc scan)
+{
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->batchringbuf.scanPos.valid = false;
+	scan->batchringbuf.markPos.valid = false;
+
+	scan->batchringbuf.markBatch = NULL;
+	scan->batchringbuf.headBatch = 0;	/* initial head batch */
+	scan->batchringbuf.nextBatch = 0;	/* initial batch starts empty */
+	scan->batchringbuf.done = false;
+	memset(&scan->batchringbuf.cache, 0, sizeof(scan->batchringbuf.cache));
+
+	scan->usebatchring = true;
+}
+
+/*
+ * Reset state used for a batch index scan
+ */
+void
+index_batchscan_reset(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	bool		markBatchFreed = false;
+
+	Assert(scan->xs_heapfetch);
+
+	batchringbuf->scanPos.valid = false;
+	batchringbuf->markPos.valid = false;
+
+	/*
+	 * Ensure tableam_util_free_batch won't skip the old markBatch in the loop
+	 * below
+	 */
+	batchringbuf->markBatch = NULL;
+
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		if (batch == markBatch)
+			markBatchFreed = true;
+
+		tableam_util_free_batch(scan, batch);
+	}
+
+	if (!markBatchFreed && unlikely(markBatch))
+		tableam_util_free_batch(scan, markBatch);
+
+	batchringbuf->headBatch = 0;
+	batchringbuf->nextBatch = 0;
+}
+
+/*
+ * Free resources at end of batch index scan
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ */
+void
+index_batchscan_end(IndexScanDesc scan)
+{
+	/* Free all remaining loaded batches (even markBatch) */
+	scan->batchringbuf.done = true;
+	index_batchscan_reset(scan);
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		IndexScanBatch cached = scan->batchringbuf.cache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->deadItems)
+			pfree(cached->deadItems);
+		pfree(batch_alloc_base(cached, scan));
+	}
+}
+
+/*
+ * Set a mark from scanPos position
+ *
+ * Saves the current scan position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batchscan_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ */
+void
+index_batchscan_mark_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	bool		freeMarkBatch;
+
+	Assert(scan->MVCCScan);
+
+	/*
+	 * Free the previous mark batch (if any) -- but only if it isn't our
+	 * scanBatch (defensively make sure that markBatch isn't some later
+	 * still-needed batch, too)
+	 */
+	if (!markBatch || markBatch == scanBatch)
+	{
+		/* Definitely no markBatch that we should free now */
+		freeMarkBatch = false;
+	}
+	else if (likely(!index_scan_batch_loaded(scan, markPos->batch)))
+	{
+		/* Definitely have a no-longer-loaded markBatch to free */
+		freeMarkBatch = true;
+	}
+	else
+	{
+		/*
+		 * It looks like markBatch is loaded/still needed within batchringbuf.
+		 *
+		 * index_scan_batch_loaded indicates that markpos->batch is loaded
+		 * already, but we cannot fully trust it here.  It's just about
+		 * possible that markpos->batch falls within a since-recycled range of
+		 * batch offset numbers (following uint8 overflow).
+		 *
+		 * Make sure that markBatch really is loaded by directly comparing it
+		 * against all loaded batches.  We must not fail to release markBatch
+		 * when nobody else will later on.
+		 *
+		 * Note: in practice we're very unlikely to end up here.  It is very
+		 * atypical for an index scan on the inner side of a merge join to
+		 * hold on to a mark that trails the current scanBatch this much.
+		 */
+		freeMarkBatch = true;	/* i.e. index_scan_batch_loaded lied to us */
+
+		for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+		{
+			if (index_scan_batch(scan, i) == markBatch)
+			{
+				/* index_scan_batch_loaded was right/no overflow happened */
+				freeMarkBatch = false;
+				break;
+			}
+		}
+	}
+
+	if (freeMarkBatch)
+	{
+		/* Free markBatch, since it isn't loaded/needed for batchringbuf */
+		batchringbuf->markBatch = NULL; /* else call won't free markBatch */
+		tableam_util_free_batch(scan, markBatch);
+	}
+
+	/* copy the scan's position */
+	batchringbuf->markPos = *scanPos;
+	batchringbuf->markBatch = scanBatch;
+}
+
+/*
+ * Restore mark to scanPos position
+ *
+ * Restores the scan to a position saved by index_batchscan_mark_pos earlier.
+ * The scan's markPos becomes its scanPos.  The marked batch is restored as
+ * the current scanBatch when needed.
+ *
+ * We just discard all batches (other than markBatch/restored scanBatch),
+ * except when markBatch is already the scan's current scanBatch.
+ */
+void
+index_batchscan_restore_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	Assert(scan->MVCCScan);
+	Assert(!batchringbuf->done);
+	Assert(markPos->valid);
+
+	if (scanBatch == markBatch)
+	{
+		/* markBatch is already scanBatch; needn't change batchringbuf */
+		Assert(scanPos->batch == markPos->batch);
+
+		scanPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * markBatch is behind scanBatch, and so must not be saved in ring buffer
+	 * anymore.  We have to deal with restoring the mark the hard way: by
+	 * invalidating all other loaded batches.  This is similar to the case
+	 * where the scan direction changes and the scan actually crosses
+	 * batch/index page boundaries (see tableam_util_batch_dirchange).
+	 *
+	 * First, free all batches that are still in the ring buffer.
+	 */
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		Assert(batch != markBatch);
+
+		tableam_util_free_batch(scan, batch);
+	}
+
+	/*
+	 * Next "append" standalone markBatch, making the ring buffer appear as if
+	 * it was the first batch ever returned by amgetbatch for the scan
+	 */
+	markPos->batch = 0;
+	batchringbuf->scanPos = *markPos;
+	batchringbuf->nextBatch = batchringbuf->headBatch = markPos->batch;
+	index_scan_batch_append(scan, markBatch);
+	Assert(index_scan_batch(scan, batchringbuf->scanPos.batch) == markBatch);
+
+	/*
+	 * Finally, call amposreset to let index AM know to invalidate any private
+	 * state that independently tracks the scan's progress
+	 */
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Note: markBatch.deadItems[] might already contain dead items, and might
+	 * yet have more dead items saved.  tableam_util_free_batch is prepared
+	 * for that.
+	 */
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by table AMs
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Handle cross-batch change in scan direction
+ *
+ * Called by table AM when its scan changes direction in a way that
+ * necessitates backing the scan up to an index page originally associated
+ * with a now-freed batch.
+ *
+ * When we return, batchringbuf will only contain one batch (the current
+ * headBatch/scanBatch).  Caller can then safely pass this batch to amgetbatch
+ * to determine which batch comes next in the new scan direction.  From that
+ * point on batchringbuf will look as if our new scan direction had been used
+ * from the start.  This approach isn't particularly efficient, but it works
+ * well enough for what ought to be a relatively rare occurrence.
+ */
+void
+tableam_util_batch_dirchange(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch head;
+
+	/*
+	 * Release batches starting from the current "tail" batch, working
+	 * backwards until the current head batch (which must also be the current
+	 * scanBatch) is the only batch hasn't been freed
+	 */
+	while (index_scan_batch_count(scan) > 1)
+	{
+		IndexScanBatch tail = index_scan_batch(scan,
+											   batchringbuf->nextBatch - 1);
+
+		tableam_util_free_batch(scan, tail);
+		batchringbuf->nextBatch--;
+	}
+
+	/* scanBatch is now the only batch still loaded */
+	Assert(batchringbuf->headBatch == batchringbuf->scanPos.batch);
+
+	/*
+	 * Deal with index AM state that independently tracks the progress of the
+	 * scan.  Do this by flipping the batch-level scan direction, and then
+	 * calling the index AM's amposreset.
+	 */
+	head = index_scan_batch(scan, batchringbuf->headBatch);
+	head->dir = -head->dir;
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, head);
+}
+
+/*
+ * Record that scanPos item is dead
+ *
+ * Records an offset to the current scanBatch/scanPos item, saving it in
+ * scanBatch's deadItems array.  The items' index tuples will later be
+ * marked LP_DEAD when current scanBatch is freed.
+ */
+void
+tableam_util_kill_scanpositem(IndexScanDesc scan)
+{
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	if (scanBatch->deadItems == NULL)
+		scanBatch->deadItems = palloc_array(int, scan->maxitemsbatch);
+	if (scanBatch->numDead < scan->maxitemsbatch)
+		scanBatch->deadItems[scanBatch->numDead++] = scanPos->item;
+}
+
+/*
+ * Release resources associated with a batch
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * We call amreleasebatch to release any index AM resources (e.g. buffer pins)
+ * that haven't been released yet.  For plain MVCC scans, the pin was already
+ * released eagerly, so amreleasebatch is a no-op.  Index-only scans must
+ * delay dropping the pin until visibility is resolved for all items in the
+ * batch, so amreleasebatch may still need to release here.  For non-MVCC
+ * snapshot scans, the pin is always held until amreleasebatch releases it.
+ *
+ * When the batch has dead items (numDead > 0) and the index AM provides an
+ * amkillitemsbatch callback, we call it to set LP_DEAD bits in the index
+ * page.  We always recycle the batch memory via indexam_util_batch_release.
+ *
+ * Note: Calling here when 'batch' is also batchringbuf.markBatch is a no-op.
+ * Callers that don't want this should set batchringbuf.markBatch to NULL
+ * before calling us.  Note that markBatch has to be explicitly freed.
+ */
+void
+tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* don't free caller's batch if it is scan's current markBatch */
+	if (batch == scan->batchringbuf.markBatch)
+		return;
+
+	/* Release interlock (e.g., buffer pin) when still held by index AM */
+	if (!scan->batchImmediateRelease)
+		tableam_util_release_batch(scan, batch);
+
+	/*
+	 * Let the index AM set LP_DEAD bits in the index page, if applicable.
+	 *
+	 * batch.deadItems[] is now in whatever order the scan returned items in.
+	 * We might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify deadItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numDead > 0 &&
+		scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
+	{
+		if (batch->numDead > 1)
+		{
+			qsort(batch->deadItems, batch->numDead, sizeof(int),
+				  batch_compare_int);
+			batch->numDead = qunique(batch->deadItems, batch->numDead,
+									 sizeof(int), batch_compare_int);
+		}
+
+		scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
+	}
+
+	/*
+	 * Use cache, just like indexam_util_batch_release does it.
+	 */
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchringbuf.cache[i] == NULL)
+		{
+			/* found empty slot, we're done */
+			scan->batchringbuf.cache[i] = batch;
+			return;
+		}
+	}
+
+	if (batch->deadItems)
+		pfree(batch->deadItems);
+	pfree(batch_alloc_base(batch, scan));
+}
+
+/*
+ * Release batch resources held by the index AM
+ *
+ * Called by the table AM when it's safe to release whatever resources the
+ * index AM holds to prevent unsafe concurrent TID recycling by VACUUM
+ * (typically a buffer pin on the batch's index page in batch's opaque area).
+ */
+void
+tableam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* Only supposed to be called during !batchImmediateRelease scans */
+	Assert(!scan->batchImmediateRelease);
+
+	scan->indexRelation->rd_indam->amreleasebatch(scan, batch);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Unlock batch's index page buffer lock
+ *
+ * Unlocks the given buffer in preparation for amgetbatch returning items
+ * saved in that batch.  Performs extra steps required by amgetbatch callers
+ * in passing.
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbatch and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We drop both
+ * the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ *
+ * For amgetbatch callers, when batchImmediateRelease is set (plain MVCC
+ * scans), we also release the pin here.  Otherwise the table AM will call
+ * amreleasebatch later when it's safe to drop the pin.
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (scan->usebatchring)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amkillitemsbatch has a way to detect
+		 * when concurrent heap TID recycling by VACUUM might have taken
+		 * place.  It'll only be safe to set any index tuple LP_DEAD bits when
+		 * the page LSN hasn't advanced.
+		 */
+		batch->lsn = BufferGetLSNAtomic(buf);
+
+		/* Drop the lock */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+		if (scan->batchImmediateRelease)
+		{
+			/*
+			 * Plain MVCC scan: release the pin now.  No amreleasebatch
+			 * callback will be needed later.  The index AM caller must clear
+			 * its own opaque buf field after we return.
+			 */
+			ReleaseBuffer(buf);
+		}
+
+		/* else: table AM will call amreleasebatch when ready */
+	}
+	else
+	{
+		/* amgetbitmap (not amgetbatch) caller */
+		Assert(scan->heapRelation == NULL);
+
+		/* drop both the lock and the pin */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buf);
+	}
+}
+
+/*
+ * Allocate a new batch
+ *
+ * Used by index AMs that support amgetbatch interface (both during amgetbatch
+ * and amgetbitmap scans).
+ *
+ * Returns IndexScanBatch with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Housekeeping fields (buf, knownEndBackward/Forward, firstItem, lastItem,
+ * numDead, deadItems, currTuples) are initialized here.  The table AM's
+ * batch_init callback is invoked here to initialize the table AM opaque area.
+ * The index AM caller is responsible for filling in its per-batch opaque
+ * fields and the matching items[] array.
+ *
+ * Once populated, caller either passes the batch to indexam_util_batch_unlock
+ * (ahead of amgetbatch returning it), or to indexam_util_batch_release (when
+ * the page had no matches).
+ */
+IndexScanBatch
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	IndexScanBatch batch = NULL;
+	bool		new_alloc = false;
+
+	/*
+	 * Lazily compute batch_table_offset on first allocation.  This combines
+	 * the table AM and index AM opaque sizes into a single offset that can be
+	 * used to find the table AM opaque area (and the true allocation base)
+	 * from the batch pointer.
+	 */
+	if (scan->batch_table_offset == 0 &&
+		(scan->batch_index_opaque_size > 0 ||
+		 (scan->xs_heapfetch && scan->xs_heapfetch->batch_opaque_size > 0)))
+	{
+		uint16		table_opaque = scan->xs_heapfetch ?
+			scan->xs_heapfetch->batch_opaque_size : 0;
+
+		scan->batch_table_offset = table_opaque +
+			scan->batch_index_opaque_size;
+	}
+
+	/* First look for an existing batch from the cache */
+	if (scan->usebatchring)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchringbuf.cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchringbuf.cache[i];
+				scan->batchringbuf.cache[i] = NULL;
+				break;
+			}
+		}
+	}
+	else if (scan->xs_bitmap_batch != NULL)
+	{
+		/*
+		 * Reuse cached batch from prior amgetbitmap iteration.  This path is
+		 * hit on every amgetbitmap call here after the scan's first.
+		 */
+		batch = scan->xs_bitmap_batch;
+		scan->xs_bitmap_batch = NULL;
+	}
+
+	if (!batch)
+	{
+		Size		prefix_sz;
+		Size		base_sz;
+		Size		trailing_sz;
+		Size		allocsz;
+		char	   *raw;
+
+		/* AM opaque areas before the batch pointer */
+		prefix_sz = scan->batch_table_offset;
+
+		/* IndexScanBatchData header + items[] */
+		base_sz = offsetof(IndexScanBatchData, items) +
+			sizeof(BatchMatchingItem) * scan->maxitemsbatch;
+
+		/*
+		 * Trailing data after items[]: table AM per-item data (e.g. visInfo)
+		 * and currTuples index AM tuple workspace.
+		 */
+		trailing_sz = 0;
+		if (scan->xs_want_itup)
+		{
+			if (scan->xs_heapfetch &&
+				scan->xs_heapfetch->batch_per_item_size > 0)
+				trailing_sz += MAXALIGN(scan->xs_heapfetch->batch_per_item_size *
+										scan->maxitemsbatch);
+			trailing_sz += scan->batch_tuples_workspace;
+		}
+
+		allocsz = prefix_sz + MAXALIGN(base_sz) + trailing_sz;
+		raw = palloc(allocsz);
+		batch = (IndexScanBatch) (raw + prefix_sz);
+
+		/* Set up currTuples pointer for index-only scans */
+		if (scan->xs_want_itup && scan->batch_tuples_workspace > 0)
+		{
+			Size		itemsEnd = MAXALIGN(base_sz);
+			Size		tableTrailing = 0;
+
+			if (scan->xs_heapfetch &&
+				scan->xs_heapfetch->batch_per_item_size > 0)
+				tableTrailing = MAXALIGN(scan->xs_heapfetch->batch_per_item_size *
+										 scan->maxitemsbatch);
+			batch->currTuples = (char *) batch + itemsEnd + tableTrailing;
+		}
+		else
+			batch->currTuples = NULL;
+
+		/*
+		 * Batches allocate deadItems lazily (though note that cached batches
+		 * keep their deadItems allocation when recycled)
+		 */
+		batch->deadItems = NULL;
+		new_alloc = true;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && scan->batch_tuples_workspace > 0 &&
+			 batch->currTuples == NULL));
+
+	/* Let the table AM initialize its per-batch opaque area */
+	if (scan->xs_heapfetch)
+		table_index_batch_init(scan, batch, new_alloc);
+
+	/* shared initialization */
+	batch->knownEndBackward = false;
+	batch->knownEndForward = false;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numDead = 0;
+
+	return batch;
+}
+
+/*
+ * Release allocated batch
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by tableam_util_free_batch, which calls here after the
+ * index AM's amkillitemsbatch routine (if any).  Index AMs that use batches
+ * should call here to release a batch from their amgetbatch or amgetbitmap
+ * routines.
+ *
+ * The rules for batch ownership differ slightly for amgetbitmap scans; see
+ * the amgetbitmap documentation in doc/src/sgml/indexam.sgml for details.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	if (scan->usebatchring)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->batchringbuf.done)
+		{
+			/* Don't bother using cache when scan is ending */
+		}
+		else
+		{
+			/*
+			 * Use cache.  This is generally only beneficial when there are
+			 * many small rescans of an index.
+			 */
+			for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+			{
+				if (scan->batchringbuf.cache[i] == NULL)
+				{
+					/* found empty slot, we're done */
+					scan->batchringbuf.cache[i] = batch;
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->deadItems)
+			pfree(batch->deadItems);
+	}
+	else
+	{
+		/*
+		 * amgetbitmap scan caller.
+		 *
+		 * amgetbitmap routines are required to allocate no more than one
+		 * batch at a time, so we'll always have a free slot.
+		 */
+		Assert(scan->xs_bitmap_batch == NULL);
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->deadItems == NULL);
+		Assert(batch->currTuples == NULL);
+
+		scan->xs_bitmap_batch = batch;
+		return;
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch_alloc_base(batch, scan));
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index cb921ca2e..e75577a7e 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -179,18 +179,21 @@ hold on to the pin (used when reading from the leaf page) until _after_
 they're done visiting the heap (for TIDs from pinned leaf page) prevents
 concurrent TID recycling.  VACUUM cannot get a conflicting cleanup lock
 until the index scan is totally finished processing its leaf page.
+Note that the table AM determines when and where index page pins are
+dropped.  This is required by any index AM that implements the amgetbatch
+interface.  (See also, doc/src/sgml/indexam.sgml).
 
-This approach is fairly coarse, so we avoid it whenever possible.  In
-practice most index scans won't hold onto their pin, and so won't block
-VACUUM.  These index scans must deal with TID recycling directly, which is
-more complicated and not always possible.  See later section on making
-concurrent TID recycling safe.
+Blocking VACUUM like this can be disruptive, so table AMs avoid it whenever
+possible.  The heap table AM usually drops leaf page pins right away, though
+not during scans that use a non-MVCC snapshot.  Index-only scans may also
+retain pins in some cases.
 
-Opportunistic index tuple deletion performs almost the same page-level
-modifications while only holding an exclusive lock.  This is safe because
-there is no question of TID recycling taking place later on -- only VACUUM
-can make TIDs recyclable.  See also simple deletion and bottom-up
-deletion, below.
+Opportunistic index tuple deletion performs the same page-level
+modifications as VACUUM, while only holding an exclusive lock.  This is
+safe because there is no question of TID recycling taking place -- only
+VACUUM can make TIDs recyclable.  In other words, VACUUM's cleanup lock
+serves to protect non-MVCC snapshot scans from concurrent TID recycling
+hazards; it doesn't protect the B-Tree structure itself.
 
 Because a pin is not always held, and a page can be split even while
 someone does hold a pin on it, it is possible that an indexscan will
@@ -444,44 +447,25 @@ Making concurrent TID recycling safe
 ------------------------------------
 
 As explained in the earlier section about deleting index tuples during
-VACUUM, we implement a locking protocol that allows individual index scans
-to avoid concurrent TID recycling.  Index scans opt-out (and so drop their
-leaf page pin when visiting the heap) whenever it's safe to do so, though.
-Dropping the pin early is useful because it avoids blocking progress by
-VACUUM.  This is particularly important with index scans used by cursors,
-since idle cursors sometimes stop for relatively long periods of time.  In
-extreme cases, a client application may hold on to an idle cursors for
-hours or even days.  Blocking VACUUM for that long could be disastrous.
+VACUUM, we implement a locking protocol that helps table AMs deal with TID
+recycling hazards during scans that use a non-MVCC snapshot.
 
 Index scans that don't hold on to a buffer pin are protected by holding an
 MVCC snapshot instead.  This more limited interlock prevents wrong answers
 to queries, but it does not prevent concurrent TID recycling itself (only
 holding onto the leaf page pin while accessing the heap ensures that).
+For the most part, it is up to the table AM to deal with concurrent TID
+recycling hazards.  But we still need to directly consider such hazards when
+marking a known-dead index tuple LP_DEAD.
 
-Index-only scans can never drop their buffer pin, since they are unable to
-tolerate having a referenced TID become recyclable.  Index-only scans
-typically just visit the visibility map (not the heap proper), and so will
-not reliably notice that any stale TID reference (for a TID that pointed
-to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
-the heap by VACUUM.  This could easily allow VACUUM to set the whole heap
-page to all-visible in the visibility map immediately afterwards.  An MVCC
-snapshot is only sufficient to avoid problems during plain index scans
-because they must access granular visibility information from the heap
-proper.  A plain index scan will even recognize LP_UNUSED items in the
-heap (items that could be recycled but haven't been just yet) as "not
-visible" -- even when the heap page is generally considered all-visible.
-
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
-(described in full in simple deletion, below) is also more complicated for
-index scans that drop their leaf page pins.  We must be careful to avoid
-LP_DEAD-marking any new index tuple that looks like a known-dead index
-tuple because it happens to share the same TID, following concurrent TID
-recycling.  It's just about possible that some other session inserted a
-new, unrelated index tuple, on the same leaf page, which has the same
-original TID.  It would be totally wrong to LP_DEAD-set this new,
+We must avoid LP_DEAD-marking any new index tuple that looks like a
+known-dead index tuple because it happens to share the same TID, following
+concurrent TID recycling.  It's just about possible that some other session
+inserted a new, unrelated index tuple, on the same leaf page, which has the
+same original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.
 
-We handle this kill_prior_tuple race condition by having affected index
+We handle this LP_DEAD setting race condition by having all index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
@@ -734,7 +718,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index cc9c45dc4..c4ff6de2b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1037,6 +1037,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: _bt_batch_unlock in nbtsearch.c (indexam_util_batch_unlock wrapper
+ * function) has matching Valgrind buffer lock instrumentation.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
@@ -1078,13 +1081,19 @@ _bt_unlockbuf(Relation rel, Buffer buf)
 	 * Buffer is pinned and locked, which means that it is expected to be
 	 * defined and addressable.  Check that proactively.
 	 */
-	VALGRIND_CHECK_MEM_IS_DEFINED(BufferGetPage(buf), BLCKSZ);
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
+
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
 
 	/* LockBuffer() asserts that pin is held by this backend */
 	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
+#if defined(USE_VALGRIND)
 	if (!RelationUsesLocalBuffers(rel))
-		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ);
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 2ba1ca660..43fc67629 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+								  OffsetNumber offnum, const ItemPointerData *tableTid,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer tableTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,11 +131,12 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btnewbatch = BTBatchGetData(newbatch);
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
@@ -144,23 +145,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btnewbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = btnewbatch->currPage = BufferGetBlockNumber(btnewbatch->buf);
+	btnewbatch->prevPage = opaque->btpo_prev;
+	btnewbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? btnewbatch->moreRight : btnewbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +186,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->nextPage,
+								 btnewbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->prevPage,
+								 btnewbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +210,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					btnewbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +278,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +337,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			btnewbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +358,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					btnewbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +463,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +499,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			btnewbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +520,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1023,91 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	newbatch->items[itemIndex].tableTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *tableTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer tableTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2819,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2910,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 0da48b42a..34aec59ca 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -160,11 +160,13 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amkillitemsbatch = btkillitemsbatch,
+		.amreleasebatch = btreleasebatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -223,13 +225,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -242,45 +244,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * If we've already initialized this scan, we can just advance it in
 		 * the appropriate direction.  If we haven't done so yet, we call
-		 * _bt_first() to get the first item in the scan.
+		 * _bt_first() to get the first batch in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -290,38 +267,43 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	ItemPointer heapTid;
+	ItemPointer tableTid;
 
 	Assert(scan->heapRelation == NULL);
 
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
-			tbm_add_tuples(tbm, heapTid, 1, false);
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			tableTid = &batch->items[itemIndex].tableTid;
+			tbm_add_tuples(tbm, tableTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
-					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					/*
+					 * _bt_next releases the prior batch for bitmap callers
+					 * before allocating the next one, so only one batch is
+					 * ever used at a time
+					 */
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
-				tbm_add_tuples(tbm, heapTid, 1, false);
+				tableTid = &batch->items[itemIndex].tableTid;
+				tbm_add_tuples(tbm, tableTid, 1, false);
 				ntids++;
 			}
 		}
@@ -348,8 +330,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -363,19 +343,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
+	scan->batch_index_opaque_size = MAXALIGN(sizeof(BTBatchData));
+	scan->batch_tuples_workspace = BLCKSZ;
 
 	return scan;
 }
@@ -389,64 +361,186 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	BTBatchData *btbatch = BTBatchGetData(batch);
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	bool		killedsomething = false;
+	Buffer		buf;
+	XLogRecPtr	latestlsn;
+
+	/* Table AM should have already released batch page's pin by now */
+	Assert(batch->numDead > 0);
+
+	buf = _bt_getbuf(rel, btbatch->currPage, BT_READ);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_bt_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = BTPageGetOpaque(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate through batch->deadItems[] in leaf page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
+		OffsetNumber offnum = kitem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
+		Assert(i == 0 ||
+			   offnum >= batch->items[batch->deadItems[i - 1]].indexOffset);
+
+		if (offnum < minoff)
+			continue;			/* pure paranoia */
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
+
+			if (BTreeTupleIsPosting(ituple))
+			{
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->tableTid))
+						break;	/* out of posting list loop */
+
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < batch->numDead)
+						kitem = &batch->items[batch->deadItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing dead items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->tableTid))
+				killtuple = true;
+
+			/*
+			 * Mark index item as dead, if it isn't already.  Since this
+			 * happens while holding a shared buffer lock, it's possible that
+			 * multiple processes attempt to do this simultaneously, leading
+			 * to multiple full-page images being sent to WAL (if
+			 * wal_log_hints or data checksums are enabled), which is
+			 * undesirable.
+			 */
+			if (killtuple && !ItemIdIsDead(iid))
+			{
+				if (!killedsomething)
+				{
+					/*
+					 * Use the hint bit infrastructure to check if we can
+					 * update the page while just holding a share lock. If we
+					 * are not allowed, there's no point continuing.
+					 */
+					if (!BufferBeginSetHintBits(buf))
+						goto unlock_page;
+				}
+
+				/* found the item/all posting list items */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 *
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
+	 * only rely on the page-level flag in !heapkeyspace indexes.)
+	 */
+	if (killedsomething)
+	{
+		opaque->btpo_flags |= BTP_HAS_GARBAGE;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+unlock_page:
+	_bt_relbuf(rel, buf);
+}
+
+/*
+ *	btreleasebatch() -- Release batch's index page buffer pin
+ *
+ * Called by the table AM (via amreleasebatch) when it's safe to drop the
+ * buffer pin held to prevent concurrent TID recycling by VACUUM.
+ * Must be idempotent -- safe to call when the pin has already been released.
+ */
+void
+btreleasebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTBatchData *btbatch = BTBatchGetData(batch);
+
+	if (BufferIsValid(btbatch->buf))
+	{
+		ReleaseBuffer(btbatch->buf);
+		btbatch->buf = InvalidBuffer;
+	}
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -455,116 +549,63 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- reset array key state for scan position change
+ *
+ * Called by the core system when the scan's logical position is about to
+ * change in a way that invalidates our array key state.  This happens when
+ * restoring a marked position, or when the scan crosses a batch boundary
+ * while moving in the opposite direction to the one originally used.
+ *
+ * For direction changes, the core system will have already flipped the
+ * batch's dir field before calling here; we use this updated direction when
+ * resetting our array keys.  For mark restoration, the batch's dir will
+ * retain its original value (from when btgetbatch returned it).
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch batch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btbatch = BTBatchGetData(batch);
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Reset array keys to initial state for the batch's scan direction.  Also
+	 * clear needPrimScan and related flags.  These were set based on the soft
+	 * assumption that the scan would always proceed in the same direction.
+	 *
+	 * These steps work around the soft assumption being violated: they force
+	 * the scan to step to the next/previous page, making the arrays recover.
+	 * When we go to read that page, _bt_readpage will reliably determine if a
+	 * primitive scan really is needed based on the page's tuples.  If there's
+	 * a primitive scan, it will reposition the scan using new array values
+	 * (based on the tuples from the neighboring page we'll step on to).
+	 *
+	 * We need to reset the array key state in the correct direction so that
+	 * we won't get confused.  When the array keys are behind the key space
+	 * for the page we're stepping on to (behind in terms of the scan dir),
+	 * they will catch up automatically.  But when they're ahead of that
+	 * page's key space, the scan could miss matching tuples.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, batch->dir);
+	if (ScanDirectionIsForward(batch->dir))
+		btbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		btbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -880,15 +921,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1038,8 +1070,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aae6acb7f..fe9c6f605 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -23,53 +23,49 @@
 #include "pgstat.h"
 #include "storage/predicate.h"
 #include "utils/lsyscache.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
+static inline void _bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
+									Buffer buf);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 
 /*
- *	_bt_drop_lock_and_maybe_pin()
+ * _bt_batch_unlock() -- nbtree wrapper for indexam_util_batch_unlock.
  *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ * Performs the same Valgrind instrumentation as _bt_unlockbuf.
  */
 static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
+_bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
 {
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
 
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
+
+	indexam_util_batch_unlock(scan, batch, buf);
+
+#if defined(USE_VALGRIND)
+	if (!RelationUsesLocalBuffers(scan->indexRelation))
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
@@ -860,26 +856,23 @@ _bt_compare(Relation rel,
 }
 
 /*
- *	_bt_first() -- Find the first item in a scan.
+ *	_bt_first() -- Find the first batch in a scan.
  *
  *		We need to be clever about the direction of scan, the search
- *		conditions, and the tree ordering.  We find the first item (or,
- *		if backwards scan, the last item) in the tree that satisfies the
- *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		conditions, and the tree ordering.  We find the first leaf page (or
+ *		the last leaf page, when scanning backwards) in the tree with at least
+ *		one tuple that satisfies the qualifications in the scan key.  On
+ *		success exit, we return a new batch with that page's matching items.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index (in the given scan direction),
+ * we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -892,8 +885,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	IndexScanBatch firstbatch;
+	BTBatchData *btfirstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
+	btfirstbatch = BTBatchGetData(firstbatch);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -909,6 +906,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
+		indexam_util_batch_release(scan, firstbatch);
 		return false;
 	}
 
@@ -918,7 +916,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return false;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -935,14 +936,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1242,7 +1239,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1510,9 +1507,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+	_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(btfirstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1528,22 +1525,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+			_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(btfirstbatch->buf))
 		{
 			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, firstbatch);
 			return false;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, btfirstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1561,165 +1559,73 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
- *	_bt_next() -- Get the next item in a scan.
+ *	_bt_next() -- Get the next batch in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which leaf page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
- *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btpriorbatch = BTBatchGetData(priorbatch);
 	BlockNumber blkno,
 				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	bool		moreInDir;
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = btpriorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = btpriorbatch->prevPage;
+	lastcurrblkno = btpriorbatch->currPage;
+	moreInDir = ScanDirectionIsForward(dir) ?
+		btpriorbatch->moreRight : btpriorbatch->moreLeft;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * Cancel primitive index scans that were scheduled when priorbatch's call
+	 * to _bt_readpage happened to use the opposite direction to the one that
+	 * we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	/*
+	 * For bitmap scan callers, release the prior batch now so that
+	 * _bt_readnextpage can reuse its memory.  This way bitmap scans never
+	 * need more than one batch allocation.
+	 */
+	if (!scan->usebatchring)
+		indexam_util_batch_release(scan, priorbatch);
+
+	if (blkno == P_NONE || !moreInDir)
+	{
+		/*
+		 * priorbatch's page is known to be the final leaf page with matches
+		 * in this scan direction (its _bt_readpage call figured that out).
+		 *
+		 * Note: if so->needPrimScan is set, then priorbatch's leaf page is
+		 * actually just the final page for the current primitive index scan
+		 * in this scan direction (the scan will continue in _bt_first).
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
@@ -1732,73 +1638,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btfirstbatch = BTBatchGetData(firstbatch);
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = false;
+		btfirstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		_bt_batch_unlock(scan, firstbatch, btfirstbatch->buf);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, btfirstbatch->buf);
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = btfirstbatch->nextPage;
+	else
+		blkno = btfirstbatch->prevPage;
+	lastcurrblkno = btfirstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !btfirstbatch->moreRight : !btfirstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1808,102 +1731,67 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior batch's nextPage or prevPage (depending on the current
+ * scan direction), and lastcurrblkno is the prior batch's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns batch containing data from the next page that has
+ * at least one matching item.  If there are no more matching items in the
+ * given scan direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTBatchData *btnewbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan);
+	btnewbatch = BTBatchGetData(newbatch);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for lastcurrblkno, a page to the left of
+	 * blkno (or to the right, when the scan is moving backwards)
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	btnewbatch->moreLeft = true;
+	btnewbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			btnewbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+			btnewbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
 														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			if (btnewbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(btnewbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1911,17 +1799,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = btnewbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = btnewbatch->prevPage;
 			}
 		}
 		else
@@ -1936,19 +1824,38 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, btnewbatch->buf);
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !btnewbatch->moreRight : !btnewbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(btnewbatch->currPage == blkno);
+	_bt_batch_unlock(scan, newbatch, btnewbatch->buf);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2174,25 +2081,24 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btfirstbatch = BTBatchGetData(firstbatch);
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	btfirstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(btfirstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2203,7 +2109,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btfirstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2229,9 +2135,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b0918589..76b38301a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -19,17 +19,13 @@
 
 #include "access/nbtree.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -144,247 +140,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	return key;
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
-/*
- * _bt_killitems - set LP_DEAD state for items an indexscan caller has
- * told us were killed
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
- *
- * We match items by heap TID before assuming they are the right ones to set
- * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
- * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
- * page, so the page's TIDs can't have been recycled by now.  There's no risk
- * that we'll confuse a new index tuple that happens to use a recycled TID
- * with a now-removed tuple with the same TID (that used to be on this same
- * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
- * the page LSN having not changed since back when _bt_readpage saw the page.
- * We totally give up on setting LP_DEAD bits when the page LSN changed.
- *
- * We give up much less often during !so->dropPin scans, but it still happens.
- * We cope with cases where items have moved right due to insertions.  If an
- * item has moved off the current page due to a split, we'll fail to find it
- * and just give up on it.
- */
-void
-_bt_killitems(IndexScanDesc scan)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber minoff;
-	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
-	bool		killedsomething = false;
-	Buffer		buf;
-
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
-
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
-	{
-		/*
-		 * We have held the pin on this page since we read the index tuples,
-		 * so all we need to do is lock it.  The pin will have prevented
-		 * concurrent VACUUMs from recycling any of the TIDs on the page.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
-		_bt_lockbuf(rel, buf, BT_READ);
-	}
-	else
-	{
-		XLogRecPtr	latestlsn;
-
-		Assert(!BTScanPosIsPinned(so->currPos));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
-		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
-		{
-			/* Modified, give up on hinting */
-			_bt_relbuf(rel, buf);
-			return;
-		}
-
-		/* Unmodified, hinting is safe */
-	}
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
-		OffsetNumber offnum = kitem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
-
-		if (offnum < minoff)
-			continue;			/* pure paranoia */
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-			bool		killtuple = false;
-
-			if (BTreeTupleIsPosting(ituple))
-			{
-				int			pi = i + 1;
-				int			nposting = BTreeTupleGetNPosting(ituple);
-				int			j;
-
-				/*
-				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
-				 * possible that this posting list tuple wasn't a posting list
-				 * tuple when we first encountered its heap TIDs.
-				 */
-				for (j = 0; j < nposting; j++)
-				{
-					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
-
-					if (!ItemPointerEquals(item, &kitem->heapTid))
-						break;	/* out of posting list loop */
-
-					/*
-					 * kitem must have matching offnum when heap TIDs match,
-					 * though only in the common case where the page can't
-					 * have been concurrently modified
-					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
-
-					/*
-					 * Read-ahead to later kitems here.
-					 *
-					 * We rely on the assumption that not advancing kitem here
-					 * will prevent us from considering the posting list tuple
-					 * fully dead by not matching its next heap TID in next
-					 * loop iteration.
-					 *
-					 * If, on the other hand, this is the final heap TID in
-					 * the posting list tuple, then tuple gets killed
-					 * regardless (i.e. we handle the case where the last
-					 * kitem is also the last heap TID in the last index tuple
-					 * correctly -- posting tuple still gets killed).
-					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
-				}
-
-				/*
-				 * Don't bother advancing the outermost loop's int iterator to
-				 * avoid processing killed items that relate to the same
-				 * offnum/posting list tuple.  This micro-optimization hardly
-				 * seems worth it.  (Further iterations of the outermost loop
-				 * will fail to match on this same posting list's first heap
-				 * TID instead, so we'll advance to the next offnum/index
-				 * tuple pretty quickly.)
-				 */
-				if (j == nposting)
-					killtuple = true;
-			}
-			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
-				killtuple = true;
-
-			/*
-			 * Mark index item as dead, if it isn't already.  Since this
-			 * happens while holding a buffer lock possibly in shared mode,
-			 * it's possible that multiple processes attempt to do this
-			 * simultaneously, leading to multiple full-page images being sent
-			 * to WAL (if wal_log_hints or data checksums are enabled), which
-			 * is undesirable.
-			 */
-			if (killtuple && !ItemIdIsDead(iid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(buf))
-						goto unlock_page;
-				}
-
-				/* found the item/all posting list items */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint.
-	 *
-	 * Whenever we mark anything LP_DEAD, we also set the page's
-	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
-	 * only rely on the page-level flag in !heapkeyspace indexes.)
-	 */
-	if (killedsomething)
-	{
-		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		BufferFinishSetHintBits(buf, true, true);
-	}
-
-unlock_page:
-	if (!so->dropPin)
-		_bt_unlockbuf(rel, buf);
-	else
-		_bt_relbuf(rel, buf);
-}
-
-
 /*
  * The following routines manage a shared-memory area in which we track
  * assignment of "vacuum cycle IDs" to currently-active btree vacuuming
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dff7d286f..3bc5e5ccd 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1095,15 +1095,15 @@ btree_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In btree leaf pages, it is possible to modify the LP_FLAGS without
 		 * emitting any WAL record. Hence, mask the line pointer flags. See
-		 * _bt_killitems(), _bt_check_unique() for details.
+		 * btkillitemsbatch(), _bt_check_unique() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * BTP_HAS_GARBAGE is just an un-logged hint bit. So, mask it. See
-	 * _bt_delete_or_dedup_one_page(), _bt_killitems(), and _bt_check_unique()
-	 * for details.
+	 * _bt_delete_or_dedup_one_page(), btkillitemsbatch(), and
+	 * _bt_check_unique() for details.
 	 */
 	maskopaq->btpo_flags &= ~BTP_HAS_GARBAGE;
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9f5379b87..dc99fc087 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,12 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index bed6587c8..4e05bd770 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -135,7 +135,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1972,7 +1972,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,15 +1986,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3858,15 +3855,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				nheapfetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3887,6 +3885,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument->nsearches;
+				nheapfetches = indexstate->ioss_Instrument->nheapfetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3910,9 +3909,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			nheapfetches += winstrument->nheapfetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, nheapfetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 635679cc1..54c2403da 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -884,7 +884,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 90a68c0d1..0dfb01337 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495..3f0c8453d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,10 +815,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 2497ee7ed..2f636ba3e 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,7 +205,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -213,7 +213,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -666,12 +666,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 2ca822cf8..a8977ccac 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -202,6 +202,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument->nsearches;
+		Assert(node->biss_Instrument->nheapfetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f84db0476..84bff60ce 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -90,18 +87,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,78 +111,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -238,16 +163,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -407,13 +322,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -433,6 +341,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument->nsearches;
+		winstrument->nheapfetches += node->ioss_Instrument->nheapfetches;
 	}
 
 	/*
@@ -788,13 +697,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -854,12 +762,12 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 36320d7d2..67822947a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -812,6 +812,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument->nsearches;
+		Assert(node->iss_Instrument->nheapfetches == 0);
 	}
 
 	/*
@@ -1723,7 +1724,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1787,7 +1788,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index cbcae4c70..9b43ed8b4 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -54,8 +54,8 @@
  *		the inner "5's". This requires repositioning the inner "cursor"
  *		to point at the first inner "5". This is done by "marking" the
  *		first inner 5 so we can restore the "cursor" to it before joining
- *		with the second outer 5. The access method interface provides
- *		routines to mark and restore to a tuple.
+ *		with the second outer 5. The indexbatch.c interface provides
+ *		routines to mark and restore to a tuple during index scans.
  *
  *
  *		Essential operation of the merge join algorithm is as follows:
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 67d9dc35f..d61c0b6f3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -43,7 +43,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
@@ -747,7 +747,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
-		if (index->amhasgettuple)
+		if (index->amhasgetbatch)
 			add_path(rel, (Path *) ipath);
 
 		if (index->amhasgetbitmap &&
@@ -835,7 +835,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	switch (scantype)
 	{
 		case ST_INDEXSCAN:
-			if (!index->amhasgettuple)
+			if (!index->amhasgetbatch)
 				return NIL;
 			break;
 		case ST_BITMAPSCAN:
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b2fbd6a08..665ddca53 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -310,11 +310,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgetbatch = (amroutine->amgetbatch != NULL ||
+									   amroutine->amgettuple != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amgetbatch != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
@@ -411,7 +411,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = false;
 				info->amsearchnulls = false;
 				info->amcanparallel = false;
-				info->amhasgettuple = false;
+				info->amhasgetbatch = false;
 				info->amhasgetbitmap = false;
 				info->amcanmarkpos = false;
 				info->amcostestimate = NULL;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 0b1d80b5b..76b0d035f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -890,7 +890,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d4da0e8de..6d80ae003 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7104,10 +7103,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7155,60 +7150,26 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * XXX This can't work with the new table_index_getnext_slot interface,
+	 * which simply won't return a tuple that isn't visible to our snapshot.
+	 * table_index_getnext_slot will need some kind of callback that provides
+	 * a way for the scan to give up when the costs start to get out of hand.
 	 */
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, true,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
-
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
-#define VISITED_PAGES_LIMIT 100
-
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7241,8 +7202,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b74ab5f7a..06553609b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -393,7 +393,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 RelationGetRelationName(rel));
 
 	/*
-	 * This assertion matches the one in index_getnext_tid().  See page
+	 * This assertion matches the one in heapam_batch_getnext().  See page
 	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
 	Assert(TransactionIdIsValid(RecentXmin));
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..457c45de7 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -146,10 +146,12 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..219fb73e6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -167,10 +167,12 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amkillitemsbatch_function amkillitemsbatch;	/* can be NULL */
+    amreleasebatch_function amreleasebatch;
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -678,6 +680,38 @@ ambeginscan (Relation indexRelation,
    <function>ambeginscan</function> does little beyond making that call and perhaps
    acquiring locks;
    the interesting parts of index-scan startup are in <function>amrescan</function>.
+   Index access methods that use the <function>amgetbatch</function> interface
+   must also set the following fields in the scan descriptor:
+   <itemizedlist>
+    <listitem>
+     <para>
+      <literal>scan-&gt;maxitemsbatch</literal>: the maximum number of items
+      that can appear in a single batch (typically derived from the index page
+      size, e.g., <literal>MaxIndexTuplesPerPage</literal>).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_index_opaque_size</literal>: the
+      <function>MAXALIGN</function>'d size of the index AM's per-batch opaque
+      area.  Each batch allocation reserves this much space immediately before
+      the <structname>IndexScanBatchData</structname> pointer, for use by the
+      index AM to store per-page navigation state (e.g., sibling page links).
+      The index AM should provide an inline accessor function to retrieve a
+      pointer to this area from an <type>IndexScanBatch</type> (for example,
+      B-tree provides <function>bt_batch_data()</function>).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_tuples_workspace</literal>: the size in bytes
+      of the per-batch tuple storage workspace used for index-only scans
+      (typically <literal>BLCKSZ</literal>), or 0 if the index AM does not
+      support index-only scans.  The workspace is accessible via
+      <structfield>batch-&gt;currTuples</structfield>.
+     </para>
+    </listitem>
+   </itemizedlist>
   </para>
 
   <para>
@@ -749,6 +783,213 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>IndexScanBatch</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples in the given
+   scan direction.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface is an alternative to
+   <function>amgettuple</function> that returns matching index entries in batches
+   rather than one at a time.  By returning all matching index entries from a
+   single index page together, the table AM gains visibility into which table
+   blocks will be needed in the near future.
+  </para>
+
+  <para>
+   The table AM passes the batch most recently returned by
+   <function>amgetbatch</function> for the given scan as
+   <literal>priorbatch</literal> (or <literal>NULL</literal> on the first call
+   for the scan).  The index AM uses information from <literal>priorbatch</literal>
+   to determine which index page to read next.
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is associated with a
+   pinned index page containing at least one matching item/tuple.  The buffer
+   pin can be held onto by the table AM as an interlock against concurrent TID
+   recycling by <command>VACUUM</command>.  See <xref linkend="index-locking"/>
+   for details on buffer pin management during index scans.
+  </para>
+
+  <para>
+   A <type>IndexScanBatch</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed (via
+   <function>tableam_util_free_batch</function>).  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.  The core
+   <filename>src/backend/access/nbtree/</filename> and
+   <filename>src/backend/access/hash/</filename> implementations provide
+   reference examples of the <function>amgetbatch</function> interface.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+   Index access methods using <function>amgetbatch</function> must set
+   <literal>scan-&gt;xs_recheck</literal> to indicate whether rechecking of
+   scan keys is required, in the same way as <function>amgettuple</function>
+   does. However, <literal>scan-&gt;xs_recheck</literal> must be set consistently
+   for an entire scan rather than varying on a per-tuple basis. This is a key
+   difference from <function>amgettuple</function>, which can set
+   <literal>scan-&gt;xs_recheck</literal> independently for each tuple it returns.
+   Index access methods that require granular control over
+   <literal>scan-&gt;xs_recheck</literal> must use the <function>amgettuple</function>
+   interface instead of <function>amgetbatch</function>.
+  </para>
+
+  <para>
+   Similarly, the <function>amgetbatch</function> interface does not currently
+   support index-only scans that return data in the form of a
+   <structname>HeapTuple</structname> pointer.  Index-only scans work by
+   copying <structname>IndexTuple</structname> records from index pages into a
+   local buffer associated with each batch.  <literal>xs_itupdesc</literal>
+   works in the same way as already described for <function>amgettuple</function>.
+   The access method must not set the <literal>scan-&gt;xs_itup</literal> or
+   <literal>scan-&gt;xs_hitup</literal> fields itself.
+   With <function>amgettuple</function>, the index AM sets
+   <literal>scan-&gt;xs_hitup</literal> to point to a reconstructed
+   <structname>HeapTuple</structname> whose lifetime extends until the next
+   <function>amgettuple</function> call &mdash; only one tuple is valid at a
+   time.  With <function>amgetbatch</function>, multiple batches are held open
+   simultaneously and items are consumed asynchronously by the table AM, so
+   there is no equivalent single-tuple lifetime for per-item
+   <structname>HeapTuple</structname> pointers.  The batch infrastructure
+   provides per-batch storage for <structname>IndexTuple</structname> copies,
+   but has no analogous mechanism for <structname>HeapTuple</structname> data
+   (used by index AMs such as <acronym>GiST</acronym> and
+   <acronym>SP-GiST</acronym> for reconstructed tuples that might not fit in
+   <structname>IndexTuple</structname> format).  This limitation could be
+   addressed in a future version of <productname>PostgreSQL</productname>.
+  </para>
+
+  <para>
+   The index access method must provide either <function>amgettuple</function>
+   or <function>amgetbatch</function>, but not both.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
+void
+amkillitemsbatch (IndexScanDesc scan,
+                  IndexScanBatch batch);
+</programlisting>
+   Called by the table AM when it has finished processing a batch that
+   contains dead items, to set <literal>LP_DEAD</literal> bits in the
+   batch's index page.  The batch's index page will never be locked or pinned
+   when this function is called.
+  </para>
+
+  <para>
+   While implementing <function>amkillitemsbatch</function> is optional,
+   doing so is recommended for performance, as it allows future scans to skip
+   known-dead index entries.  Both core index access methods that currently
+   support <function>amgetbatch</function> (B-tree and hash) implement
+   <literal>LP_DEAD</literal> marking, though third-party index access methods
+   are free to choose whether to implement this feature.
+   The table AM may call
+   <function>tableam_util_kill_scanpositem</function> to mark dead items as
+   the scan progresses.  If the batch contains any such dead items, the batch's
+   <structfield>deadItems</structfield> array will have been sorted and
+   deduplicated before <function>amkillitemsbatch</function> is called, with
+   item offsets appearing in ascending order (that is, in index page order,
+   which is also batch order) and no offset appearing more than once.  Index
+   access methods can rely on this ordering when processing dead items: the
+   <structfield>deadItems</structfield> array can be walked in lockstep with
+   the index page's item pointers, since both are in ascending offset order.
+   This also means the table AM need not call
+   <function>tableam_util_kill_scanpositem</function> in any particular order.
+   (Index access methods using <function>amgettuple</function> rely on the
+   <structfield>kill_prior_tuple</structfield> mechanism instead to mark dead
+   tuples; the <filename>src/backend/access/gist/</filename> implementation
+   provides a reference example.)
+  </para>
+
+  <para>
+   When implementing <function>amkillitemsbatch</function>, the index AM
+   should verify that the index page has not been modified since the batch was
+   originally read.  The batch's <structfield>lsn</structfield> field records
+   the page LSN at the time the index page lock was released (set
+   automatically by the core code).  The index AM should re-read the page,
+   compare the current page LSN against <structfield>batch-&gt;lsn</structfield>,
+   and give up on setting <literal>LP_DEAD</literal> bits if the LSN has
+   advanced.  An advanced LSN indicates that the page was modified &mdash;
+   possibly by <command>VACUUM</command> recycling heap TIDs &mdash; so it
+   would be unsafe to assume that index entries still point to the same heap
+   tuples.  Since <literal>LP_DEAD</literal> marking is only an optimization
+   hint, it is always safe to skip it.  See the B-tree and hash index
+   implementations for reference examples of this technique.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins when this serves an
+   internal purpose (for example, maintaining a descent stack of pinned index
+   pages for reuse across <function>amgetbatch</function> calls).  However,
+   any scheme that retains buffer pins managed by the index AM must be sure to
+   free the pins at an opportune point (for example when <function>amrescan</function>
+   and/or <function>amendscan</function> are called).  It must also keep the
+   number of retained pins fixed and small, to avoid exhausting the backend's
+   buffer pin limit.
+  </para>
+
+  <para>
+   The <function>amkillitemsbatch</function> function is optional.  Index
+   access methods that want to mark dead index tuples with
+   <literal>LP_DEAD</literal> bits should provide it; those that don't can
+   leave it set to <literal>NULL</literal> even when they provide
+   <function>amgetbatch</function>.
+  </para>
+
+  <para>
+<programlisting>
+void
+amreleasebatch (IndexScanDesc scan,
+                IndexScanBatch batch);
+</programlisting>
+   Called by the table AM (via <function>tableam_util_release_batch</function>)
+   when it is safe to release whatever resources the index AM holds to prevent
+   concurrent TID recycling by <command>VACUUM</command>.  For
+   <productname>PostgreSQL</productname>'s built-in B-tree and hash index
+   access methods, this means releasing a buffer pin on the batch's index leaf
+   page; other index access methods may hold different resources (or multiple
+   resources) in their per-batch opaque area.
+  </para>
+
+  <para>
+   The <function>amreleasebatch</function> function is required for any index
+   access method that provides <function>amgetbatch</function>.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +1009,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -781,6 +1022,29 @@ amgetbitmap (IndexScanDesc scan,
    struct must be set to NULL.
   </para>
 
+  <para>
+   Index access methods that use the <function>amgetbatch</function> interface
+   will generally also want to use the batch allocation infrastructure
+   (<function>indexam_util_batch_alloc</function> and
+   <function>indexam_util_batch_release</function>) within their
+   <function>amgetbitmap</function> implementation.  The convention is that only
+   one batch is allocated at a time during <function>amgetbitmap</function>,
+   unlike <function>amgetbatch</function> where several batches may be
+   outstanding in the batch ring buffer concurrently.  To maintain this
+   one-batch-at-a-time invariant, the index AM itself releases its prior batch
+   via <function>indexam_util_batch_release</function> just as the scan leaves
+   that batch's index page and is about to generate the next batch &mdash; the
+   same point where it extracts navigation state (such as sibling-page links)
+   from <literal>priorbatch</literal>.  This early release is specific to
+   <function>amgetbitmap</function> scans; during <function>amgetbatch</function>
+   scans the <literal>priorbatch</literal> is strictly owned by the table AM
+   and core code, and the index AM must never release it.  See
+   <function>_bt_next</function> and <function>_hash_next</function> for
+   reference examples.  The released batch is cached internally and reused by
+   the next <function>indexam_util_batch_alloc</function> call, avoiding
+   repeated memory allocation during the bitmap scan.
+  </para>
+
   <para>
 <programlisting>
 void
@@ -795,32 +1059,41 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan,
+            IndexScanBatch batch);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify the index AM that the table AM is about to change the scan's
+   logical position in a way that requires the index AM to reset any state
+   that independently tracks the scan's progress.  For example, nbtree must
+   reset the array keys used by <literal>ScalarArrayOpExpr</literal> qual
+   evaluation when the scan position changes.  This callback is invoked when
+   the table AM is about to process a batch in a different direction than
+   was used when the batch was originally returned by
+   <function>amgetbatch</function>, and also when a marked scan position is
+   about to be restored.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   When <function>amposreset</function> is called due to a cross-batch
+   direction change, the core system will have already flipped the batch's
+   <structfield>dir</structfield> field to reflect the new scan direction
+   before making the call.  The index AM should use this updated direction
+   when resetting any state that depends on knowing which way the scan is
+   proceeding.  When called to restore a marked position, the batch's
+   <structfield>dir</structfield> is not modified; it retains the direction
+   from when the batch was originally returned.  In both cases, the batch
+   passed to <function>amposreset</function> is the batch that will be used
+   to continue the scan.
   </para>
 
   <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   Index access methods that have private state which must be reset when the
+   scan position changes must provide an <function>amposreset</function>
+   implementation.  Index AMs with no such state may set
+   <structfield>amposreset</structfield> to NULL.
+   The <function>amposreset</function> function can only be provided when the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.
   </para>
 
   <para>
@@ -975,6 +1248,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        Access methods that always return entries in the natural ordering
        of their data (such as btree) should set
        <structfield>amcanorder</structfield> to true.
+       Both <function>amgettuple</function> and <function>amgetbatch</function>
+       scans support this capability.
        Currently, such access methods must use btree-compatible strategy
        numbers for their equality and ordering operators.
       </para>
@@ -987,41 +1262,56 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
        <replaceable>operator</replaceable> <replaceable>constant</replaceable>.  Scan modifiers
        of that form can be passed to <function>amrescan</function> as described
-       previously.
+       previously.  Note that <function>amgetbatch</function> scans do not
+       currently support ordering operators.  The core executor expects
+       <function>amgettuple</function> to set
+       <structfield>xs_orderbyvals</structfield> for each returned tuple, but
+       there is currently no mechanism to associate per-item ordering values
+       with individual items within a batch.  This would require an additional
+       layer of indirection that does not yet exist, but could be added in a
+       future version of <productname>PostgreSQL</productname>.
       </para>
      </listitem>
     </itemizedlist>
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position.  When a batch is processed in a different direction
+   than it was originally fetched, or when a marked position is restored, the
+   index AM is notified via the <function>amposreset</function> callback (if
+   provided) so it can reset any private state that independently tracks the
+   scan's progress (such as array key state).  See the description of
+   <function>amposreset</function> in <xref linkend="index-functions"/> for
+   details.
   </para>
 
   <para>
-   Both the scan position and the mark position (if any) must be maintained
+   The scan position (if any) must be maintained by the table AM and index AM
    consistently in the face of concurrent insertions or deletions in the
    index.  It is OK if a freshly-inserted entry is not returned by a scan that
    would have found the entry if it had existed when the scan started, or for
@@ -1044,12 +1334,14 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   Instead of using <function>amgettuple</function>, an index scan can be done with
-   <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
-   noticeably more efficient than <function>amgettuple</function> because it allows
-   avoiding lock/unlock cycles within the access method.  In principle
-   <function>amgetbitmap</function> should have the same effects as repeated
-   <function>amgettuple</function> calls, but we impose several restrictions to
+   Instead of using <function>amgettuple</function> or
+   <function>amgetbatch</function>, an index scan can be done with
+   <function>amgetbitmap</function> to fetch all tuples in one call.  This can
+   be noticeably more efficient than with an <quote>ordered</quote> scan
+   because it allows efficient sequential access to table AM pages containing
+   matches.  In principle <function>amgetbitmap</function> should have the
+   same effects as repeated <function>amgettuple</function> or
+   <function>amgetbatch</function> calls, but we impose several restrictions to
    simplify matters.  First of all, <function>amgetbitmap</function> returns all
    tuples at once and marking or restoring scan positions isn't
    supported. Secondly, the tuples are returned in a bitmap which doesn't
@@ -1066,8 +1358,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
 
   <para>
    Note that it is permitted for an access method to implement only
-   <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
-   if its internal implementation is unsuited to one API or the other.
+   <function>amgetbitmap</function> and not <function>amgettuple</function>/<function>amgetbatch</function>,
+   or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
  </sect1>
@@ -1123,11 +1415,17 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
      </listitem>
      <listitem>
       <para>
-       An index scan must maintain a pin
-       on the index page holding the item last returned by
-       <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
-       entries from pages that are pinned by other backends.  The need
-       for this rule is explained below.
+       A pin must be held on any index page whose items might still need to
+       be followed, and <function>ambulkdelete</function> must acquire a
+       cleanup lock on each index page, which will block if any other
+       backend holds a pin on that page.
+       For <function>amgettuple</function> scans, the index access method
+       manages this pin directly.
+       For <function>amgetbatch</function> scans, the index AM owns the
+       resources (e.g. buffer pins) in its per-batch opaque area, while the
+       table AM controls when they are released via
+       <function>amreleasebatch</function>.
+       The need for this rule is explained below.
       </para>
      </listitem>
     </itemizedlist>
@@ -1156,23 +1454,41 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   This solution requires that index scans be <quote>synchronous</quote>: we have
-   to fetch each heap tuple immediately after scanning the corresponding index
-   entry.  This is expensive for a number of reasons.  An
-   <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
-   and only visit the heap tuples sometime later, requires much less index
-   locking overhead and can allow a more efficient heap access pattern.
+   This solution requires that <function>amgettuple</function> index scans be
+   <quote>synchronous</quote>: the table AM must fetch each heap tuple
+   immediately after scanning the corresponding index entry.  This is
+   expensive for a number of reasons.  An
+   <quote>asynchronous</quote> scan in which we collect many TIDs from the
+   index, and only visit the heap tuples sometime later, requires much less
+   index locking overhead and can allow a more efficient heap access pattern.
    Per the above analysis, we must use the synchronous approach for
    non-MVCC-compliant snapshots, but an asynchronous scan is workable
    for a query using an MVCC snapshot.
   </para>
 
   <para>
-   In an <function>amgetbitmap</function> index scan, the access method does not
-   keep an index pin on any of the returned tuples.  Therefore
+   Index page resources held by <function>amgetbatch</function> batches
+   (typically buffer pins, stored in the index AM's per-batch opaque area)
+   are owned by the index AM but released under the table AM's control via
+   the <function>amreleasebatch</function> callback.  See the
+   <function>amgetbatch</function>, <function>amreleasebatch</function>, and
+   <function>amkillitemsbatch</function> descriptions in
+   <xref linkend="index-functions"/> for details.
+  </para>
+
+  <para>
+   In an <function>amgetbitmap</function> index scan, the access method does
+   not keep an index pin on any of the returned tuples.  Therefore
    it is only safe to use such scans with MVCC-compliant snapshots.
   </para>
 
+  <para>
+   Index access methods that use <function>amgettuple</function> must manage
+   pin lifetime themselves, since there is no table AM intermediary (unlike
+   with <function>amgetbatch</function>).  The index AM must hold a pin on the
+   current index page until the scan moves to a different page or ends.
+  </para>
+
   <para>
    When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
    index access method within a serializable transaction will acquire a
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 982532fe7..f58c28815 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1153,12 +1153,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 31f8d2b81..389281b51 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -334,10 +334,12 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amreleasebatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0de551837..d331aedfa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -227,8 +227,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -257,6 +255,9 @@ BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
 BatchMVCCState
+BatchMatchingItem
+BatchRingBuffer
+BatchRingItemPos
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
@@ -1290,6 +1291,8 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatch
+IndexScanBatchData
 IndexScanDesc
 IndexScanDescData
 IndexScanInstrumentation
@@ -3486,18 +3489,17 @@ amcanreturn_function
 amcostestimate_function
 amendscan_function
 amestimateparallelscan_function
+amgetbatch_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.53.0



  [application/octet-stream] v13-0004-Make-IndexScanInstrumentation-a-pointer-in-execu.patch (9.1K, 17-v13-0004-Make-IndexScanInstrumentation-a-pointer-in-execu.patch)
  download | inline diff:
From dd5f2dae5f10e0c709db86db77aad41fff5532e2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 10 Mar 2026 14:22:25 -0400
Subject: [PATCH v13 04/19] Make IndexScanInstrumentation a pointer in executor
 scan nodes.

Change the IndexScanInstrumentation fields in IndexScanState,
IndexOnlyScanState, and BitmapIndexScanState from inline structs to
pointers.  This avoids adding additional overhead as new fields are
added to IndexScanInstrumentation, at least in the common case where the
instrumentation isn't used (i.e. when the executor node isn't being run
through an EXPLAIN ANALYZE).
---
 src/include/nodes/execnodes.h              |  6 +++---
 src/backend/commands/explain.c             |  6 +++---
 src/backend/executor/nodeBitmapIndexscan.c |  8 ++++++--
 src/backend/executor/nodeIndexonlyscan.c   | 12 ++++++++----
 src/backend/executor/nodeIndexscan.c       | 14 +++++++++-----
 5 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c067d5a..51782d1fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1723,7 +1723,7 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
-	IndexScanInstrumentation iss_Instrument;
+	IndexScanInstrumentation *iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
@@ -1774,7 +1774,7 @@ typedef struct IndexOnlyScanState
 	ExprContext *ioss_RuntimeContext;
 	Relation	ioss_RelationDesc;
 	struct IndexScanDescData *ioss_ScanDesc;
-	IndexScanInstrumentation ioss_Instrument;
+	IndexScanInstrumentation *ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
@@ -1815,7 +1815,7 @@ typedef struct BitmapIndexScanState
 	ExprContext *biss_RuntimeContext;
 	Relation	biss_RelationDesc;
 	struct IndexScanDescData *biss_ScanDesc;
-	IndexScanInstrumentation biss_Instrument;
+	IndexScanInstrumentation *biss_Instrument;
 	SharedIndexScanInstrumentation *biss_SharedInfo;
 } BitmapIndexScanState;
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 93918a223..bed6587c8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3878,7 +3878,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			{
 				IndexScanState *indexstate = ((IndexScanState *) planstate);
 
-				nsearches = indexstate->iss_Instrument.nsearches;
+				nsearches = indexstate->iss_Instrument->nsearches;
 				SharedInfo = indexstate->iss_SharedInfo;
 				break;
 			}
@@ -3886,7 +3886,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			{
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
-				nsearches = indexstate->ioss_Instrument.nsearches;
+				nsearches = indexstate->ioss_Instrument->nsearches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3894,7 +3894,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			{
 				BitmapIndexScanState *indexstate = ((BitmapIndexScanState *) planstate);
 
-				nsearches = indexstate->biss_Instrument.nsearches;
+				nsearches = indexstate->biss_Instrument->nsearches;
 				SharedInfo = indexstate->biss_SharedInfo;
 				break;
 			}
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 058a59ef5..2ca822cf8 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -201,7 +201,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * shutdown on the workers.  On rescan it will spin up new workers
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
-		winstrument->nsearches += node->biss_Instrument.nsearches;
+		winstrument->nsearches += node->biss_Instrument->nsearches;
 	}
 
 	/*
@@ -272,6 +272,10 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
 		return indexstate;
 
+	/* Set up instrumentation of bitmap index scans if requested */
+	if (estate->es_instrument)
+		indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+
 	/* Open the index relation. */
 	lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
 	indexstate->biss_RelationDesc = index_open(node->indexid, lockmode);
@@ -323,7 +327,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate->biss_ScanDesc =
 		index_beginscan_bitmap(indexstate->biss_RelationDesc,
 							   estate->es_snapshot,
-							   &indexstate->biss_Instrument,
+							   indexstate->biss_Instrument,
 							   indexstate->biss_NumScanKeys);
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d093745..f84db0476 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -92,7 +92,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
-								   &node->ioss_Instrument,
+								   node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
@@ -432,7 +432,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * shutdown on the workers.  On rescan it will spin up new workers
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
-		winstrument->nsearches += node->ioss_Instrument.nsearches;
+		winstrument->nsearches += node->ioss_Instrument->nsearches;
 	}
 
 	/*
@@ -604,6 +604,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
 		return indexstate;
 
+	/* Set up instrumentation of index-only scans if requested */
+	if (estate->es_instrument)
+		indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+
 	/* Open the index relation. */
 	lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
 	indexRelation = index_open(node->indexid, lockmode);
@@ -785,7 +789,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
-								 &node->ioss_Instrument,
+								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
@@ -851,7 +855,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
-								 &node->ioss_Instrument,
+								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff0..36320d7d2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -109,7 +109,7 @@ IndexNext(IndexScanState *node)
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
-								   &node->iss_Instrument,
+								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys);
 
@@ -205,7 +205,7 @@ IndexNextWithReorder(IndexScanState *node)
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
-								   &node->iss_Instrument,
+								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys);
 
@@ -811,7 +811,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * shutdown on the workers.  On rescan it will spin up new workers
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
-		winstrument->nsearches += node->iss_Instrument.nsearches;
+		winstrument->nsearches += node->iss_Instrument->nsearches;
 	}
 
 	/*
@@ -971,6 +971,10 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
 		return indexstate;
 
+	/* Set up instrumentation of index scans if requested */
+	if (estate->es_instrument)
+		indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+
 	/* Open the index relation. */
 	lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
 	indexstate->iss_RelationDesc = index_open(node->indexid, lockmode);
@@ -1720,7 +1724,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
-								 &node->iss_Instrument,
+								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan);
@@ -1784,7 +1788,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
-								 &node->iss_Instrument,
+								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan);
-- 
2.53.0



  [application/octet-stream] v13-0005-heapam-Track-heap-block-in-IndexFetchHeapData-us.patch (3.7K, 18-v13-0005-heapam-Track-heap-block-in-IndexFetchHeapData-us.patch)
  download | inline diff:
From c3fbd90e7bf49c8deaea2692b0c80c954c69690d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 10 Mar 2026 14:40:35 -0400
Subject: [PATCH v13 05/19] heapam: Track heap block in IndexFetchHeapData
 using xs_blk

Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that
tracks which heap block is currently pinned in xs_cbuf.

heapam_index_fetch_tuple now uses xs_blk to determine when buffer
switching is needed, replacing the previous approach that compared
buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call.
The new approach skips the buffer-switching path entirely when the next
TID is on the same heap page, which also means that heap_page_prune_opt
is called exactly once per block (when the block is first pinned).

This is preparatory work for an upcoming commit that will need xs_blk
to manage buffer pin transfers between the scan and the executor slot.
---
 src/include/access/heapam.h              |  1 +
 src/backend/access/heap/heapam_handler.c | 28 +++++++++++++++---------
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ad993c073..a859c90f4 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5137d2510..d7b05aa14 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -85,6 +85,7 @@ heapam_index_fetch_begin(Relation rel)
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -99,6 +100,7 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
 	}
+	hscan->xs_blk = InvalidBlockNumber;
 }
 
 static void
@@ -124,23 +126,29 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/* We can skip the buffer-switching logic if we're on the same page. */
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*call_again);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
+
+		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page when it is pinned for the first time
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+	Assert(hscan->xs_blk == ItemPointerGetBlockNumber(tid));
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
-- 
2.53.0



  [application/octet-stream] v13-0003-Use-fake-LSNs-to-improve-nbtree-dropPin-behavior.patch (15.1K, 19-v13-0003-Use-fake-LSNs-to-improve-nbtree-dropPin-behavior.patch)
  download | inline diff:
From 804b99906b7bfb6b2519074d1adc1077c0a2071b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 18 Jan 2026 11:14:36 -0500
Subject: [PATCH v13 03/19] Use fake LSNs to improve nbtree dropPin behavior.

Previously unlogged nbtree indexes needed to hold on to a leaf page
buffer pin when stopped on that leaf page, purely so that the
_bt_killitems process had a way to be sure that there wasn't any unsafe
concurrent TID recycling by VACUUM.  The _bt_killitems' dropPin strategy
couldn't be used before now, since it works by checking if the page LSN
has changed in the period after _bt_readpage read the page's items, but
before _bt_killitems was called.  We now use the same LSN trick with
unlogged indexes, bringing the same benefits to these scans that commit
2ed5b87f brought to scans of logged relations.

This is preparation for an upcoming commit that will add the amgetbatch
interface and switch nbtree over to it (from amgettuple).  That will go
further by completely obviating the need for amgetbatch scans to hang on
to buffer pins (barring scans involving a non-MVCC snapshot).

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
---
 src/backend/access/nbtree/README      |  5 +-
 src/backend/access/nbtree/nbtdedup.c  |  8 ++-
 src/backend/access/nbtree/nbtinsert.c | 48 +++++++++-------
 src/backend/access/nbtree/nbtpage.c   | 82 +++++++++++++++------------
 src/backend/access/nbtree/nbtree.c    |  8 ---
 src/backend/access/nbtree/nbtsearch.c |  1 -
 src/backend/access/nbtree/nbtutils.c  |  1 -
 7 files changed, 80 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc..cb921ca2e 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -485,9 +485,8 @@ We handle this kill_prior_tuple race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
-on the leaf page at all when the page's LSN has changed.  (That won't work
-with an unlogged index, so for now we don't ever apply the "don't hold
-onto pin" optimization there.)
+on the leaf page at all when the page's LSN has changed.  (This is why we
+implement "fake" LSNs for unlogged index relations.)
 
 Fastpath For Index Insertion
 ----------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 95be0b179..af7affdf4 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -69,6 +69,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 	Size		pagesaving PG_USED_FOR_ASSERTS_ONLY = 0;
 	bool		singlevalstrat = false;
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	XLogRecPtr	recptr;
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -245,7 +246,6 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		XLogRecPtr	recptr;
 		xl_btree_dedup xlrec_dedup;
 
 		xlrec_dedup.nintervals = state->nintervals;
@@ -263,9 +263,11 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 							state->nintervals * sizeof(BTDedupInterval));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 91bb37d66..c8af97dd2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1137,6 +1137,7 @@ _bt_insertonpg(Relation rel,
 	IndexTuple	oposting = NULL;
 	IndexTuple	origitup = NULL;
 	IndexTuple	nposting = NULL;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
@@ -1334,7 +1335,6 @@ _bt_insertonpg(Relation rel,
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
 			uint8		xlinfo;
-			XLogRecPtr	recptr;
 			uint16		upostingoff;
 
 			xlrec.offnum = newitemoff;
@@ -1407,14 +1407,16 @@ _bt_insertonpg(Relation rel,
 			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
-			if (BufferIsValid(metabuf))
-				PageSetLSN(metapg, recptr);
-			if (!isleaf)
-				PageSetLSN(BufferGetPage(cbuf), recptr);
-
-			PageSetLSN(page, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		if (BufferIsValid(metabuf))
+			PageSetLSN(metapg, recptr);
+		if (!isleaf)
+			PageSetLSN(BufferGetPage(cbuf), recptr);
+
+		PageSetLSN(page, recptr);
 
 		END_CRIT_SECTION();
 
@@ -1516,6 +1518,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 	bool		newitemonleft,
 				isleaf,
 				isrightmost;
+	XLogRecPtr	recptr;
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1995,7 +1998,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 	{
 		xl_btree_split xlrec;
 		uint8		xlinfo;
-		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
@@ -2079,14 +2081,16 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
 
 		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
-		PageSetLSN(origpage, recptr);
-		PageSetLSN(rightpage, recptr);
-		if (!isrightmost)
-			PageSetLSN(spage, recptr);
-		if (!isleaf)
-			PageSetLSN(BufferGetPage(cbuf), recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(origpage, recptr);
+	PageSetLSN(rightpage, recptr);
+	if (!isrightmost)
+		PageSetLSN(spage, recptr);
+	if (!isleaf)
+		PageSetLSN(BufferGetPage(cbuf), recptr);
 
 	END_CRIT_SECTION();
 
@@ -2504,6 +2508,7 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
@@ -2599,7 +2604,6 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_newroot xlrec;
-		XLogRecPtr	recptr;
 		xl_btree_metadata md;
 
 		xlrec.rootblk = rootblknum;
@@ -2633,11 +2637,13 @@ _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 							((PageHeader) rootpage)->pd_upper);
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT);
-
-		PageSetLSN(lpage, recptr);
-		PageSetLSN(rootpage, recptr);
-		PageSetLSN(metapg, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(lpage, recptr);
+	PageSetLSN(rootpage, recptr);
+	PageSetLSN(metapg, recptr);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9aa78068a..cc9c45dc4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -235,6 +235,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -286,7 +287,6 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_metadata md;
-		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
@@ -303,9 +303,11 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 		XLogRegisterBufData(0, &md, sizeof(xl_btree_metadata));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_META_CLEANUP);
-
-		PageSetLSN(metapg, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(metapg, recptr);
 
 	END_CRIT_SECTION();
 
@@ -351,6 +353,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 	BlockNumber rootblkno;
 	uint32		rootlevel;
 	BTMetaPageData *metad;
+	XLogRecPtr	recptr;
 
 	Assert(access == BT_READ || heaprel != NULL);
 
@@ -473,7 +476,6 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 		if (RelationNeedsWAL(rel))
 		{
 			xl_btree_newroot xlrec;
-			XLogRecPtr	recptr;
 			xl_btree_metadata md;
 
 			XLogBeginInsert();
@@ -497,10 +499,12 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 			XLogRegisterData(&xlrec, SizeOfBtreeNewroot);
 
 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT);
-
-			PageSetLSN(rootpage, recptr);
-			PageSetLSN(metapg, recptr);
 		}
+		else
+			recptr = XLogGetFakeLSN(rel);
+
+		PageSetLSN(rootpage, recptr);
+		PageSetLSN(metapg, recptr);
 
 		END_CRIT_SECTION();
 
@@ -1162,6 +1166,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	XLogRecPtr	recptr;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1226,7 +1231,6 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (needswal)
 	{
-		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
@@ -1248,9 +1252,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -1292,6 +1298,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	char	   *updatedbuf = NULL;
 	Size		updatedbuflen = 0;
 	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
+	XLogRecPtr	recptr;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(ndeletable > 0 || nupdatable > 0);
@@ -1342,7 +1349,6 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (needswal)
 	{
-		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
@@ -1366,9 +1372,11 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE);
-
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -2103,6 +2111,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 	OffsetNumber nextoffset;
 	IndexTuple	itup;
 	IndexTupleData trunctuple;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2253,7 +2262,6 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_mark_page_halfdead xlrec;
-		XLogRecPtr	recptr;
 
 		xlrec.poffset = poffset;
 		xlrec.leafblk = leafblkno;
@@ -2274,12 +2282,14 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
 		XLogRegisterData(&xlrec, SizeOfBtreeMarkPageHalfDead);
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD);
-
-		page = BufferGetPage(subtreeparent);
-		PageSetLSN(page, recptr);
-		page = BufferGetPage(leafbuf);
-		PageSetLSN(page, recptr);
 	}
+	else
+		recptr = XLogGetFakeLSN(rel);
+
+	page = BufferGetPage(subtreeparent);
+	PageSetLSN(page, recptr);
+	page = BufferGetPage(leafbuf);
+	PageSetLSN(page, recptr);
 
 	END_CRIT_SECTION();
 
@@ -2337,6 +2347,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	uint32		targetlevel;
 	IndexTuple	leafhikey;
 	BlockNumber leaftopparent;
+	XLogRecPtr	recptr;
 
 	page = BufferGetPage(leafbuf);
 	opaque = BTPageGetOpaque(page);
@@ -2676,7 +2687,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
 		uint8		xlinfo;
-		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
 
@@ -2720,25 +2730,25 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlinfo = XLOG_BTREE_UNLINK_PAGE;
 
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+	}
+	else
+		recptr = XLogGetFakeLSN(rel);
 
-		if (BufferIsValid(metabuf))
-		{
-			PageSetLSN(metapg, recptr);
-		}
-		page = BufferGetPage(rbuf);
+	if (BufferIsValid(metabuf))
+		PageSetLSN(metapg, recptr);
+	page = BufferGetPage(rbuf);
+	PageSetLSN(page, recptr);
+	page = BufferGetPage(buf);
+	PageSetLSN(page, recptr);
+	if (BufferIsValid(lbuf))
+	{
+		page = BufferGetPage(lbuf);
 		PageSetLSN(page, recptr);
-		page = BufferGetPage(buf);
+	}
+	if (target != leafblkno)
+	{
+		page = BufferGetPage(leafbuf);
 		PageSetLSN(page, recptr);
-		if (BufferIsValid(lbuf))
-		{
-			page = BufferGetPage(lbuf);
-			PageSetLSN(page, recptr);
-		}
-		if (target != leafblkno)
-		{
-			page = BufferGetPage(leafbuf);
-			PageSetLSN(page, recptr);
-		}
 	}
 
 	END_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6d0a6f27f..0da48b42a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -407,13 +407,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	 * race condition involving VACUUM setting pages all-visible in the VM.
 	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
 	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
 	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
 	 * Pins cannot be held for more than an instant during bitmap scans either
 	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
@@ -424,7 +417,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	 */
 	so->dropPin = (!scan->xs_want_itup &&
 				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
 				   scan->heapRelation != NULL);
 
 	so->markItemIndex = -1;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ab452c7b0..aae6acb7f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -67,7 +67,6 @@ _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
 	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
 	 * when concurrent heap TID recycling by VACUUM might have taken place.
 	 */
-	Assert(RelationNeedsWAL(rel));
 	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 	_bt_relbuf(rel, so->currPos.buf);
 	so->currPos.buf = InvalidBuffer;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 32d94e236..9b0918589 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -236,7 +236,6 @@ _bt_killitems(IndexScanDesc scan)
 		XLogRecPtr	latestlsn;
 
 		Assert(!BTScanPosIsPinned(so->currPos));
-		Assert(RelationNeedsWAL(rel));
 		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-- 
2.53.0



  [application/octet-stream] v13-0002-Use-GetXLogInsertEndRecPtr-in-XLogGetFakeLSN.patch (3.1K, 20-v13-0002-Use-GetXLogInsertEndRecPtr-in-XLogGetFakeLSN.patch)
  download | inline diff:
From 4ded292fbfb661b6f20e5c0d44dcb67fc9ace2bf Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 5 Mar 2026 16:51:28 +0100
Subject: [PATCH v13 02/19] Use GetXLogInsertEndRecPtr in XLogGetFakeLSN

The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:

  ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000

Such failures are very hard to trigger, particularly outside aggressive
test scenarios.

Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().

This is a long-standing bug in gistGetFakeLSN(), probably introduced by
c6b92041d38. The fake LSN approach was generalized to other index types,
inheriting the same bug.

Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
---
 src/include/access/xlog.h               |  1 +
 src/backend/access/transam/xlog.c       | 16 ++++++++++++++++
 src/backend/access/transam/xloginsert.c |  2 +-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 553d6fc9c..dcc12eb8c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -238,6 +238,7 @@ extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool XLogInsertAllowed(void);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
+extern XLogRecPtr GetXLogInsertEndRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 
 extern uint64 GetSystemIdentifier(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92e44a501..2b6e61201 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9623,6 +9623,22 @@ GetXLogInsertRecPtr(void)
 	return XLogBytePosToRecPtr(current_bytepos);
 }
 
+/*
+ * Get latest WAL insert pointer
+ */
+XLogRecPtr
+GetXLogInsertEndRecPtr(void)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	uint64		current_bytepos;
+
+	SpinLockAcquire(&Insert->insertpos_lck);
+	current_bytepos = Insert->CurrBytePos;
+	SpinLockRelease(&Insert->insertpos_lck);
+
+	return XLogBytePosToEndRecPtr(current_bytepos);
+}
+
 /*
  * Get latest WAL write pointer
  */
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4e049982f..14a55c7ad 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -598,7 +598,7 @@ XLogGetFakeLSN(Relation rel)
 		 * last call.
 		 */
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
-		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
+		XLogRecPtr	currlsn = GetXLogInsertEndRecPtr();
 
 		Assert(!RelationNeedsWAL(rel));
 		Assert(RelationIsPermanent(rel));
-- 
2.53.0

view thread (367+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox