public inbox for [email protected]  
help / color / mirror / Atom feed
From: Peter Geoghegan <[email protected]>
To: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Wed, 1 Apr 2026 18:50:15 -0400
Message-ID: <CAH2-Wz=t3G53xKGYEWqm_QV35ExRgT2k=qhw_VHe5oGjdFRwtA@mail.gmail.com> (raw)
In-Reply-To: <CAH2-Wz=nz-zr=gaXL1od_F7dcr=7d+3jEEveqY-bgcAKF6wZJQ@mail.gmail.com>
References: <vbb4naf2tvm2tm7yoml54pzvrmn77p4nvq4awfa4wufc3hn7qx@mof5q6li3xzv>
	<CAH2-Wzn1j2a0p3OqmqrV6zADtWA_QpG82U6F9yCYG1Uschm_fA@mail.gmail.com>
	<CAH2-WzmCH+N2-H2oGSQcbn2fArbk7GXyD6rQN6kn5P=FX9R-_g@mail.gmail.com>
	<CAH2-WzkyG01682zwqyUTwV=Zq+M_qGgi1NbXwp1H-piRSfJsgQ@mail.gmail.com>
	<CAH2-Wz=HJc+QV2AZ9mUY43aKL+n+a1JQ-7OGE=MOkqSAtoKJug@mail.gmail.com>
	<t6mtqbv2mbfhjni4bvwdgoecppjmxvbyfwl6utovzv76xc2672@k3o5ryevaeqv>
	<CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
	<CAH2-Wz=Vxsgas35ZzOJJW1ceqp9TJ2DFhKmXULwUAcVpfD73xA@mail.gmail.com>
	<CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com>
	<CAH2-WzkFRoTjD9T8ykYDzOMxzGiWFqcAkbK8B=HjfpoMdM4E8A@mail.gmail.com>
	<chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj>
	<CAH2-Wz=nz-zr=gaXL1od_F7dcr=7d+3jEEveqY-bgcAKF6wZJQ@mail.gmail.com>

On Wed, Apr 1, 2026 at 11:51 AM Peter Geoghegan <[email protected]> wrote:
> I plan on clarifying things in this area for v20. It will fully
> embrace the idea that IndexFetchTableData is just the table AM
> specific part of IndexScanDescData. This will allow me to move most of
> the batch-related calls currently in indexam.c over to heapam/the
> table AM. That will make it very clear that the table AM fully owns
> the batch ring buffer (there'll continue to be indexam.c calls to do
> things like take a mark, since that is implemented in a way that's
> already table AM agnostic).

Attached is v20, which does things this way. Now it's completely clear
that the scan's batch ring buffer is fully controlled by the table AM.

There are now only 2 remaining indexbatch.c functions that get called
from indexam.c. Since these are both trivial, highly generic
functions, restructuring things so that heapam would call these two
seemed unnecessary.

> One problem with the current design is that it still treats
> IndexFetchTableData as not just the table AM piece of
> IndexScanDescData. I now realize that it works that way purely so we
> can avoid changing anything about index_fetch_tuple. But, as you say,
> we really *should* do that anyway -- it shouldn't need its own
> IndexFetchTableData (or its own IndexScanDescData), since it literally
> doesn't have anything to do with index scans.

v20 also repurposes and renames index_fetch_tuple (the new name is
fetch_tid). This clarifies that it's a specialized function used only
by callers that inherently need to pass their own TID (the 2 remaining
constraint enforcement callers). These callers don't really perform an
index scan at all (the TID might come from an index, but clearly
_bt_check_unique isn't performing an index scan, by any reasonable
definition).

IndexFetchTableData is no longer used by fetch_tid or its callers. As
discussed, this change makes it possible to pass an IndexScanDescData
pointer (not a IndexFetchTableData pointer) through to heapam
functions like heapam_index_fetch_reset and heapam_index_fetch_end
(since doing so won't break index_fetch_tuple/fetch_tid, now that
their dependency on IndexFetchTableData has been broken).

One consequence of this refactoring is that we can no longer just call
heapam_index_fetch_reset from indexam.c when restoring a mark -- that
will reset the scan's batchringbuf, which isn't appropriate here (we
usually don't change the batchringbuf at all, in the happy path we
only need to override scanPos with markPos). indexam.c now calls a new
heapam_index_fetch_restrpos function (through its table AM shim)
instead -- heapam handles restoring the mark on its behalf.

The addition of this new heapam_index_fetch_restrpos function seems
strictly better to me. Restoring a mark in v19 meant resetting
xs_vm_items back to 1 -- which is only appropriate during a true
rescan. As in many other areas, the table AM can fully see what's
going on (it has all the relevant context), which allows it to do
exactly the right thing (layering that obscures useful information
about what's really going on is generally something we want to avoid
IMV).

--
Peter Geoghegan


Attachments:

  [application/octet-stream] v20-0016-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch (6.3K, 2-v20-0016-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch)
  download | inline diff:
From 31eeb94f302c2e141c1cafc62e506443c5816c80 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sun, 22 Mar 2026 15:12:41 -0400
Subject: [PATCH v20 16/17] aio: Fix pgaio_io_wait() for staged IOs (B).

Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion.  The owner only
promises to advance it to PGAIO_HS_SUBMITTED.  The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().

Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance.  The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.

It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change.  There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.

Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.

Backpatch-through: 18
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2BmZYrSdnhk-XrBYO18H829K77S9gMKUsykOiTJtqB43g%40mail.gmail.com
---
 src/include/storage/aio_internal.h            |  7 +++
 src/backend/storage/aio/aio.c                 | 50 ++++++++++++++++---
 src/backend/storage/aio/aio_init.c            |  1 +
 .../utils/activity/wait_event_names.txt       |  1 +
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 33e1e2dc0..deec83bd0 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -214,6 +214,13 @@ typedef struct PgAioBackend
 	uint16		num_staged_ios;
 	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
 
+	/*
+	 * Other backends sometimes need to wait for the owning backend to submit.
+	 * The per-IO CV would work for that purpose, but a per-backend CV allows
+	 * for just one broadcast per submitted batch.
+	 */
+	ConditionVariable submit_cv;
+
 	/*
 	 * List of in-flight IOs. Also contains IOs that aren't strictly speaking
 	 * in-flight anymore, but have been waited-for and completed by another
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 1a96a8a9a..f0b4a6522 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -571,6 +571,16 @@ pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState
 	return ioh->generation != ref_generation;
 }
 
+/*
+ * Whether we need to wait via the IO method. Don't check via the IO method if
+ * the issuing backend is executing the IO synchronously.
+ */
+static bool
+pgaio_io_needs_wait_one(PgAioHandle *ioh)
+{
+	return pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS);
+}
+
 /*
  * Wait for IO to complete. External code should never use this, outside of
  * the AIO subsystem waits are only allowed via pgaio_wref_wait().
@@ -610,23 +620,38 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 				elog(ERROR, "IO in wrong state: %d", state);
 				break;
 
-			case PGAIO_HS_SUBMITTED:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
 
 				/*
-				 * If we need to wait via the IO method, do so now. Don't
-				 * check via the IO method if the issuing backend is executing
-				 * the IO synchronously.
+				 * The owner hasn't submitted the IO yet. If we need to wait
+				 * via the IO method, wait for submission, giving this backend
+				 * the chance to call ->wait_one().
 				 */
-				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				if (pgaio_io_needs_wait_one(ioh))
+				{
+					PgAioBackend *backend = &pgaio_ctl->backend_state[ioh->owner_procno];
+
+					ConditionVariablePrepareToSleep(&backend->submit_cv);
+					while (!pgaio_io_was_recycled(ioh, ref_generation, &state) &&
+						   (state == PGAIO_HS_DEFINED ||
+							state == PGAIO_HS_STAGED))
+						ConditionVariableSleep(&backend->submit_cv, WAIT_EVENT_AIO_IO_SUBMIT);
+					ConditionVariableCancelSleep();
+					continue;
+				}
+				pg_fallthrough;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/* If we need to wait via the IO method, do so now. */
+				if (pgaio_io_needs_wait_one(ioh))
 				{
 					pgaio_method_ops->wait_one(ioh, ref_generation);
 					continue;
 				}
 				pg_fallthrough;
 
-				/* waiting for owner to submit */
-			case PGAIO_HS_DEFINED:
-			case PGAIO_HS_STAGED:
 				/* waiting for reaper to complete */
 				/* fallthrough */
 			case PGAIO_HS_COMPLETED_IO:
@@ -1179,6 +1204,15 @@ pgaio_submit_staged(void)
 
 	pgaio_my_backend->num_staged_ios = 0;
 
+	/*
+	 * Wake any backend that started waiting for any of these IOs before
+	 * submission, if it is necessary to call ->wait_one() to guarantee
+	 * progress with the configured IO method.  On its side, pgaio_io_wait()
+	 * only waits for submit_cv on IO methods needing that.
+	 */
+	if (pgaio_method_ops->wait_one)
+		ConditionVariableBroadcast(&pgaio_my_backend->submit_cv);
+
 	pgaio_debug(DEBUG4,
 				"aio: submitted %d IOs",
 				total_submitted);
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index d3c68d8b0..c2095ef77 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -184,6 +184,7 @@ AioShmemInit(void)
 
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		ConditionVariableInit(&bs->submit_cv);
 		dclist_init(&bs->in_flight_ios);
 
 		/* initialize per-backend IOs */
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 6be80d2da..947135a4b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -197,6 +197,7 @@ ABI_compatibility:
 Section: ClassName - WaitEventIO
 
 AIO_IO_COMPLETION	"Waiting for another process to complete IO."
+AIO_IO_SUBMIT	"Waiting for another process to submit IO."
 AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
 AIO_IO_URING_EXECUTION	"Waiting for IO execution via io_uring."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
-- 
2.53.0



  [application/octet-stream] v20-0015-WIP-read-stream-Split-decision-about-look-ahead-.patch (14.8K, 3-v20-0015-WIP-read-stream-Split-decision-about-look-ahead-.patch)
  download | inline diff:
From b0a807d25435ebe72a12ab543b5b7257d867f2ad Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Mon, 30 Mar 2026 12:25:07 -0400
Subject: [PATCH v20 15/17] WIP: read stream: Split decision about look ahead
 for AIO and combining

Previous commits caused a regression due to the this conflation. This is a
first attempt at fixing the problem.  Needs significant reordering and
splitting if it works out.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 242 +++++++++++++++++++++-----
 1 file changed, 195 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 144b3613c..1c375edad 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -98,10 +98,14 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
-	int16		distance;
+	/* limit of how far to read ahead for IO combining */
+	int16		combine_distance;
+	/* limit of how far to read ahead for starting IO early */
+	int16		readahead_distance;
 	uint16		distance_decay_holdoff;
 	int16		initialized_buffers;
-	int16		resume_distance;
+	int16		resume_readahead_distance;
+	int16		resume_combine_distance;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
 	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
@@ -332,8 +336,8 @@ read_stream_start_pending_read(ReadStream *stream)
 
 		/* Shrink distance: no more look-ahead until buffers are released. */
 		new_distance = stream->pinned_buffers + buffer_limit;
-		if (stream->distance > new_distance)
-			stream->distance = new_distance;
+		if (stream->readahead_distance > new_distance)
+			stream->readahead_distance = new_distance;
 
 		/* Unless we have nothing to give the consumer, stop here. */
 		if (stream->pinned_buffers > 0)
@@ -374,12 +378,23 @@ read_stream_start_pending_read(ReadStream *stream)
 		 * perform IO asynchronously when starting out with a small look-ahead
 		 * distance.
 		 */
-		if (stream->distance > 1 && stream->ios_in_progress == 0)
+		if (stream->ios_in_progress == 0)
 		{
-			if (stream->distance_decay_holdoff == 0)
-				stream->distance--;
-			else
+			if (stream->distance_decay_holdoff > 0)
 				stream->distance_decay_holdoff--;
+			else
+			{
+				if (stream->readahead_distance > 1)
+					stream->readahead_distance--;
+
+				/*
+				 * XXX: Should we actually reduce this at any time other than
+				 * a reset? For now we have to, as this is also a condition
+				 * for re-enabling fast_path.
+				 */
+				if (stream->combine_distance > 1)
+					stream->combine_distance--;
+			}
 		}
 	}
 	else
@@ -440,6 +455,101 @@ read_stream_start_pending_read(ReadStream *stream)
 	return true;
 }
 
+/*
+ * Should we continue to perform look ahead?  The look ahead may allow us to
+ * make the pending IO larger via IO combining or to issue more read ahead.
+ */
+static bool
+read_stream_should_look_ahead(ReadStream *stream)
+{
+	/* never start more IOs than our cap */
+	if (stream->ios_in_progress >= stream->max_ios)
+		return false;
+
+	/* If the callback has signaled end-of-stream, we're done */
+	if (stream->readahead_distance <= 0)
+		return false;
+
+	/* never pin more buffers than allowed */
+	if (stream->pinned_buffers + stream->pending_read_nblocks >= stream->max_pinned_buffers)
+		return false;
+
+	/*
+	 * Allow looking further ahead if we have an the process of building a
+	 * larger IO, the IO is not yet big enough and we don't yet have IO in
+	 * flight.  Note that this is allowed even if we are reaching the
+	 * readahead limit (but not the buffer pin limit).
+	 *
+	 * This is important for cases where either effective_io_concurrency is
+	 * low or we never need to wait for IO and thus are not increasing the
+	 * distance. Without this we would end up with lots of small IOs.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		stream->pinned_buffers == 0 &&
+		stream->pending_read_nblocks < stream->combine_distance)
+		return true;
+
+	/*
+	 * Don't start more readahead if that'd put us over the limit for doing
+	 * readahead.
+	 */
+	if (stream->pinned_buffers + stream->pending_read_nblocks >= stream->readahead_distance)
+		return false;
+
+	return true;
+}
+
+
+/*
+ * We don't start the pending read just because we've hit the distance limit,
+ * preferring to give it another chance to grow to full io_combine_limit size
+ * once more buffers have been consumed.  But this is not desirable in all
+ * situations - see below.
+ */
+static bool
+read_stream_should_issue_now(ReadStream *stream)
+{
+	int16		pending_read_nblocks = stream->pending_read_nblocks;
+
+	/* no IO to issue */
+	if (pending_read_nblocks == 0)
+		return false;
+
+	/* never start more IOs than our cap */
+	if (stream->ios_in_progress >= stream->max_ios)
+		return false;
+
+	/*
+	 * If the callback has signaled end-of-stream, start the read
+	 * immediately. There's no deferring it for later.
+	 */
+	if (stream->readahead_distance <= 0)
+		return true;
+
+	/*
+	 * If we've already reached io_combine_limit, there's no chance of growing
+	 * the read further.
+	 */
+	if (pending_read_nblocks >= stream->io_combine_limit)
+		return true;
+
+	/* same if capped not by io_combine_limit but combine_distance */
+	if (stream->combine_distance > 0 &&
+		pending_read_nblocks >= stream->combine_distance)
+		return true;
+
+	/*
+	 * If we currently have no reads in flight or prepared, issue the IO once
+	 * we stopped looking ahead. This ensures there's always at least one IO
+	 * prepared.
+	 */
+	if (stream->pinned_buffers == 0 &&
+		!read_stream_should_look_ahead(stream))
+		return true;
+
+	return false;
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
@@ -452,14 +562,13 @@ read_stream_look_ahead(ReadStream *stream)
 	if (stream->batch_mode)
 		pgaio_enter_batchmode();
 
-	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	while (read_stream_should_look_ahead(stream))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == stream->io_combine_limit)
+		if (read_stream_should_issue_now(stream))
 		{
 			read_stream_start_pending_read(stream);
 			continue;
@@ -479,7 +588,8 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
-			stream->distance = 0;
+			stream->readahead_distance = 0;
+			stream->combine_distance = 0;
 			break;
 		}
 
@@ -511,21 +621,13 @@ read_stream_look_ahead(ReadStream *stream)
 	}
 
 	/*
-	 * We don't start the pending read just because we've hit the distance
-	 * limit, preferring to give it another chance to grow to full
-	 * io_combine_limit size once more buffers have been consumed.  However,
-	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.  Note that the
-	 * pending read can exceed the distance goal, if the latter was reduced
-	 * after hitting the per-backend buffer limit.
+	 * Check if the pending read should be issued now, or if we should give it
+	 * another chance to grow to the full size.
+	 *
+	 * Note that the pending read can exceed the distance goal, if the latter
+	 * was reduced after hitting the per-backend buffer limit.
 	 */
-	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == stream->io_combine_limit ||
-		 (stream->pending_read_nblocks >= stream->distance &&
-		  stream->pinned_buffers == 0) ||
-		 stream->distance <= 0) &&
-		stream->ios_in_progress < stream->max_ios)
+	if (read_stream_should_issue_now(stream))
 		read_stream_start_pending_read(stream);
 
 	/*
@@ -534,7 +636,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->distance <= 0);
+	Assert(stream->pinned_buffers > 0 || stream->readahead_distance <= 0);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -724,10 +826,17 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads.
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
+	{
+		stream->readahead_distance = Min(max_pinned_buffers, stream->io_combine_limit);
+		stream->combine_distance = stream->io_combine_limit;
+	}
 	else
-		stream->distance = 1;
-	stream->resume_distance = stream->distance;
+	{
+		stream->readahead_distance = 1;
+		stream->combine_distance = 1;
+	}
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
 
 	/*
 	 * Since we always access the same relation, we can initialize parts of
@@ -826,7 +935,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
-		Assert(stream->distance == 1);
+		Assert(stream->readahead_distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
@@ -900,7 +1009,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
-			stream->distance = 0;
+			stream->readahead_distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
 			stream->buffers[oldest_buffer_index] = InvalidBuffer;
@@ -916,7 +1025,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
 		/* End of stream reached?  */
-		if (stream->distance <= 0)
+		if (stream->readahead_distance <= 0)
 			return InvalidBuffer;
 
 		/*
@@ -930,7 +1039,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
 		{
-			Assert(stream->distance <= 0);
+			Assert(stream->readahead_distance <= 0);
 			return InvalidBuffer;
 		}
 	}
@@ -951,7 +1060,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 		bool		needed_wait;
 
 		/* Sanity check that we still agree on the buffers. */
@@ -962,7 +1070,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		 * If the stream has been reset, don't even wait for the IO, just
 		 * discard it.
 		 */
-		if (stream->distance < 0)
+		if (stream->readahead_distance < 0)
 		{
 			if (pgaio_wref_valid(&stream->ios[io_index].op.io_wref) &&
 				!stream->ios[io_index].op.foreign_io)
@@ -1011,11 +1119,38 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		 * the stream, as stream->distance == 0 is used to keep track of
 		 * having reached the end.
 		 */
-		if (stream->distance > 0 && needed_wait)
+		if (stream->readahead_distance > 0 && needed_wait)
 		{
-			distance = stream->distance * 2;
-			distance = Min(distance, stream->max_pinned_buffers);
-			stream->distance = distance;
+			/* wider temporary value, due to oveflow risk */
+			int32		readahead_distance;
+
+			readahead_distance = stream->readahead_distance * 2;
+			readahead_distance = Min(readahead_distance, stream->max_pinned_buffers);
+			stream->readahead_distance = readahead_distance;
+		}
+
+		/*
+		 * Whether we needed to wait or not, allow for more IO combining if we
+		 * needed to do IO. The reason to do so independent of needing to wait
+		 * is that when the data is resident in the kernel page cache, IO
+		 * combining reduces the syscall / dispatch overhead, making it
+		 * worthwhile regardless of needing to wait.
+		 *
+		 * It is also important with io_uring as it will never signal the need
+		 * to wait for reads if all the data is in the page cache. There are
+		 * heuristics to deal with that in method_io_uring.c, but they only
+		 * work when the IO gets large enough.
+		 */
+		if (stream->combine_distance > 0 &&
+			stream->combine_distance < stream->io_combine_limit)
+		{
+			/* wider temporary value, due to oveflow risk */
+			int32		combine_distance;
+
+			combine_distance = stream->combine_distance * 2;
+			combine_distance = Min(combine_distance, stream->io_combine_limit);
+			combine_distance = Min(combine_distance, stream->max_pinned_buffers);
+			stream->combine_distance = combine_distance;
 		}
 
 		/*
@@ -1094,10 +1229,18 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
+	/*
+	 * FIXME: It's way too easy to wrongly fast path. I'm pretty sure there's
+	 * several pre-existing cases where it triggers because we are not issuing
+	 * additional prefetching (e.g. because of a small
+	 * effective_io_concurrency) and thus stream->pinned_buffers stays at 1
+	 * after read_stream_look_ahead().
+	 */
 	if (stream->ios_in_progress == 0 &&
 		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
-		stream->distance == 1 &&
+		stream->readahead_distance == 1 &&
+		stream->combine_distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
@@ -1143,8 +1286,9 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 BlockNumber
 read_stream_pause(ReadStream *stream)
 {
-	stream->resume_distance = stream->distance;
-	stream->distance = 0;
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
+	stream->readahead_distance = 0;
 	return InvalidBlockNumber;
 }
 
@@ -1156,7 +1300,8 @@ read_stream_pause(ReadStream *stream)
 void
 read_stream_resume(ReadStream *stream)
 {
-	stream->distance = stream->resume_distance;
+	stream->readahead_distance = stream->resume_readahead_distance;
+	stream->combine_distance = stream->resume_combine_distance;
 }
 
 /*
@@ -1172,7 +1317,8 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->distance = -1;
+	stream->readahead_distance = -1;
+	stream->combine_distance = -1;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
@@ -1204,8 +1350,10 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->ios_in_progress == 0);
 
 	/* Start off assuming data is cached. */
-	stream->distance = 1;
-	stream->resume_distance = stream->distance;
+	stream->readahead_distance = 1;
+	stream->combine_distance = 1;
+	stream->resume_readahead_distance = stream->readahead_distance;
+	stream->resume_combine_distance = stream->combine_distance;
 	stream->distance_decay_holdoff = 0;
 }
 
-- 
2.53.0



  [application/octet-stream] v20-0001-Rename-heapam_index_fetch_tuple-argument-for-cla.patch (2.6K, 4-v20-0001-Rename-heapam_index_fetch_tuple-argument-for-cla.patch)
  download | inline diff:
From d750c07e532221884e29a177e6779a2a4dc9f986 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 1 Apr 2026 17:44:54 -0400
Subject: [PATCH v20 01/17] Rename heapam_index_fetch_tuple argument for
 clarity.

Rename heapam_index_fetch_tuple's call_again argument to heap_continue,
for consistency with the pointed-to variable name (IndexScanDescData's
xs_heap_continue field).

Preparation for an upcoming commit that will move index scan related
heapam functions into their own file.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
---
 src/backend/access/heap/heapam_handler.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1be8ea484..dc7db5885 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -126,7 +126,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot,
-						 bool *call_again, bool *all_dead)
+						 bool *heap_continue, bool *all_dead)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
@@ -135,7 +135,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	if (!*heap_continue)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
@@ -161,7 +161,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 											snapshot,
 											&bslot->base.tupdata,
 											all_dead,
-											!*call_again);
+											!*heap_continue);
 	bslot->base.tupdata.t_self = *tid;
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
 
@@ -171,7 +171,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * Only in a non-MVCC snapshot can more than one member of the HOT
 		 * chain be visible.
 		 */
-		*call_again = !IsMVCCLikeSnapshot(snapshot);
+		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
 
 		slot->tts_tableOid = RelationGetRelid(scan->rel);
 		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
@@ -179,7 +179,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	else
 	{
 		/* We've reached the end of the HOT chain. */
-		*call_again = false;
+		*heap_continue = false;
 	}
 
 	return got_heap_tuple;
-- 
2.53.0



  [application/octet-stream] v20-0017-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch (3.1K, 5-v20-0017-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch)
  download | inline diff:
From 5914896082179c4961f6cb985eefba2c978d3826 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sun, 22 Mar 2026 15:19:08 -0400
Subject: [PATCH v20 17/17] WIP: aio: bufmgr: Fix race condition leading to
 deadlocks with io_uring

If backend A is in the process of starting IO for a buffer, there is a short
period in which the buffer is marked as IO_IN_PROGRESS without having an
associated AIO wait reference. If a backend B does WaitIO() on that buffer,
it'll wait for the buffer's IO condition variable to be set. Most of the time
that is OK, when the IO on the buffer finishes, the CV will be signalled.
However, with io_uring, it is possible that the issuer (A) of the IO never
gets around to doing so, e.g. because it is waiting for something done by B.

To fix that, we need to signal the CV when staging IO. That's annoying as CV
broadcasts are not cheap. So we at least avoid it for the common case of IO
being executed synchronously.

I hope that eventually we can get away from needing multiple systems for
signalling IO completion, but we are clearly not there yet.
---
 src/backend/storage/buffer/bufmgr.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5c6457002..5dbc50542 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -8278,7 +8278,7 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
  * replaced while IO is ongoing.
  */
 static pg_attribute_always_inline void
-buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+buffer_stage_common(PgAioHandle *ioh, uint8 cb_data, bool is_write, bool is_temp)
 {
 	uint64	   *io_data;
 	uint8		handle_data_len;
@@ -8377,7 +8377,23 @@ buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
 		 * keeps track.
 		 */
 		if (!is_temp)
+		{
 			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+
+			/*
+			 * A backend might have started waiting for the IO using the
+			 * buffer's condition variable, but once the IO is submitted, it
+			 * should wait via the AIO subsystem, as a waiter might need to
+			 * complete the IO.
+			 *
+			 * However, doing broadcasts is not free, so we like to avoid it
+			 * when not necessary. If the IO is being executed synchronously,
+			 * this backend will always end up signalling the IOCV without
+			 * further waiting, therefore avoid doing so here.
+			 */
+			if (!(cb_data & READ_BUFFERS_SYNCHRONOUSLY))
+				ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf_hdr));
+		}
 	}
 }
 
@@ -8866,7 +8882,7 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *td,
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
-	buffer_stage_common(ioh, false, false);
+	buffer_stage_common(ioh, cb_data, false, false);
 }
 
 static PgAioResult
@@ -8917,7 +8933,7 @@ shared_buffer_readv_complete_local(PgAioHandle *ioh, PgAioResult prior_result,
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
-	buffer_stage_common(ioh, false, true);
+	buffer_stage_common(ioh, cb_data, false, true);
 }
 
 static PgAioResult
-- 
2.53.0



  [application/octet-stream] v20-0014-Hacky-implementation-of-making-read_stream_reset.patch (5.1K, 6-v20-0014-Hacky-implementation-of-making-read_stream_reset.patch)
  download | inline diff:
From dcaaaf70f5a3ea7849d28b204ba2b9757d048587 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Thu, 19 Mar 2026 23:06:58 -0400
Subject: [PATCH v20 14/17] Hacky implementation of making
 read_stream_reset()/end() not wait for IO

Not waiting for IO during read_stream_reset() can be important for performance
in cases where read streams are frequently reset before the end is
reached. Current users do not commonly do that, but the upcoming work to use a
read stream to prefetch table blocks as part of index scans can do so
frequently in some query patterns. E.g. if there is an index scan on the inner
side of a nested loop.

FIXME: This implementation is problematic though, as there is nothing forcing
the discarded IOs to ever be completed if the backend goes idle. That's at the
very least problematic because it leads to the underlying buffers continuing
to be pinned and the IOs showing up in the pg_aios view.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |  1 +
 src/backend/storage/aio/aio.c         | 27 ++++++++++++++++++++++
 src/backend/storage/aio/read_stream.c | 32 ++++++++++++++++++++++-----
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ec543b784..c184e97a9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -328,6 +328,7 @@ extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
 extern void pgaio_wref_wait(PgAioWaitRef *iow);
 extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
 
+extern void pgaio_wref_discard_result(PgAioWaitRef *iow);
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8f7e26607..1a96a8a9a 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1050,6 +1050,33 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 	return false;
 }
 
+void
+pgaio_wref_discard_result(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	bool		am_owner;
+	PgAioHandle *ioh;
+	PgAioHandleState state;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (!am_owner)
+		elog(ERROR, "not you");
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	pgaio_debug_io(DEBUG2, ioh,
+				   "discarding result %p",
+				   ioh->report_return);
+
+	if (ioh->resowner)
+		pgaio_io_release_resowner(&ioh->resowner_node, false);
+}
+
+
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 49971833d..144b3613c 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -524,7 +524,7 @@ read_stream_look_ahead(ReadStream *stream)
 		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks >= stream->distance &&
 		  stream->pinned_buffers == 0) ||
-		 stream->distance == 0) &&
+		 stream->distance <= 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream);
 
@@ -534,7 +534,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+	Assert(stream->pinned_buffers > 0 || stream->distance <= 0);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -916,7 +916,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
 		/* End of stream reached?  */
-		if (stream->distance == 0)
+		if (stream->distance <= 0)
 			return InvalidBuffer;
 
 		/*
@@ -930,7 +930,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
 		{
-			Assert(stream->distance == 0);
+			Assert(stream->distance <= 0);
 			return InvalidBuffer;
 		}
 	}
@@ -958,7 +958,27 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+		/*
+		 * If the stream has been reset, don't even wait for the IO, just
+		 * discard it.
+		 */
+		if (stream->distance < 0)
+		{
+			if (pgaio_wref_valid(&stream->ios[io_index].op.io_wref) &&
+				!stream->ios[io_index].op.foreign_io)
+			{
+				pgaio_wref_discard_result(&stream->ios[io_index].op.io_wref);
+				pgaio_wref_clear(&stream->ios[io_index].op.io_wref);
+			}
+			else
+				WaitReadBuffers(&stream->ios[io_index].op);
+
+			needed_wait = false;
+		}
+		else
+		{
+			needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+		}
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
@@ -1152,7 +1172,7 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->distance = 0;
+	stream->distance = -1;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
-- 
2.53.0



  [application/octet-stream] v20-0013-read_stream-Only-increase-distance-when-waiting-.patch (4.2K, 7-v20-0013-read_stream-Only-increase-distance-when-waiting-.patch)
  download | inline diff:
From bdc674ad32f14fa90b10db5fb3327e767b541219 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 18:00:53 -0500
Subject: [PATCH v20 13/17] read_stream: Only increase distance when waiting
 for IO

This avoids increasing the distance to the maximum in cases where the IO
subsystem is already keeping up. This turns out to be important for
performance for two reasons:

- Pinning a lot of buffers is not cheap. If additional pins allow us to avoid
  IO waits, it's definitely worth it, but if we can already do all the
  necessary readahead at a distance of 16, reading ahead 512 buffers can
  increase the CPU overhead substantially.  This is particularly noticeable
  when the to-be-read blocks are already in the kernel page cache.

- If the read stream is read to completion, reading in data earlier than
  needed is of limited consequences, leaving aside the CPU costs mentioned
  above. But if the read stream will not be fully consumed, e.g. because it is
  on the inner side of a nested loop join, the additional IO can be a serious
  performance issue. This is not that commonly a problem for current read
  stream users, but the upcoming work, to use a read stream to fetch table
  pages as part of an index scan, frequently encounters this.

Note that this commit would have substantial performance downsides without
earlier commits. In particular the earlier commit to avoid decreasing the
readahead distance when there was recent IO is crucial, as otherwise we very
often would end up not reading ahead aggressively enough anymore with this
commit, due to increasing the distance less often.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 39 +++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index fa27ec792..49971833d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -952,22 +952,51 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	{
 		int16		io_index = stream->oldest_io_index;
 		int32		distance;	/* wider temporary value, clamped below */
+		bool		needed_wait;
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		WaitReadBuffers(&stream->ios[io_index].op);
+		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		/*
+		 * If the IO was executed synchronously, we will never see
+		 * WaitReadBuffers() block. This is particularly crucial when
+		 * effective_io_concurrency=0 is used, as all IO will be
+		 * synchronous. Without treating synchronous IO as having waited, we'd
+		 * never allow the distance to get large enough to allow for IO
+		 * combining, resulting in bad performance.
+		 */
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_SYNCHRONOUSLY)
+			needed_wait = true;
+
+		/*
+		 * Have the look-ahead distance ramp up rapidly after we needed to
+		 * wait for IO. We only increase the distance when we needed to wait,
+		 * to avoid increasing the distance further than necessary, as looking
+		 * ahead too far can be costly, both due to the cost of unnecessarily
+		 * pinning many buffers and due to doing IOs that may never be
+		 * consumed if the stream is ended/reset before completion.
+		 *
+		 * If we did not need to wait, the current distance was evidently
+		 * sufficient.
+		 *
+		 * NB: May not increase the distance if we already reached the end of
+		 * the stream, as stream->distance == 0 is used to keep track of
+		 * having reached the end.
+		 */
+		if (stream->distance > 0 && needed_wait)
+		{
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
 
 		/*
 		 * As we needed IO, prevent distance from being reduced within our
-- 
2.53.0



  [application/octet-stream] v20-0010-read_stream-Issue-IO-synchronously-while-in-fast.patch (2.7K, 8-v20-0010-read_stream-Issue-IO-synchronously-while-in-fast.patch)
  download | inline diff:
From 2fa7b7b7f92836f400bea35bc908990efe5fa2db Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 16:25:41 -0500
Subject: [PATCH v20 10/17] read_stream: Issue IO synchronously while in fast
 path

While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.

This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter.  It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index cd54c1a74..7893fdf03 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -833,6 +833,21 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (stream->advice_enabled)
 				flags |= READ_BUFFERS_ISSUE_ADVICE;
 
+			/*
+			 * While in fast-path, execute any IO that we might encounter
+			 * synchronously. Because we are, right now, only looking one
+			 * block ahead, dispatching any occasional IO to workers would
+			 * have the overhead of dispatching to workers, without any
+			 * realistic chance of the IO completing before we need it. We
+			 * will switch to non-synchronous IO after this.
+			 *
+			 * Arguably we should do so only for worker, as there's far less
+			 * dispatch overhead with io_uring. However, tests so far have not
+			 * shown a clear downside and additional io_method awareness here
+			 * seems not great from an abstraction POV.
+			 */
+			flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -860,6 +875,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios_in_progress = 1;
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
+
+			/*
+			 * XXX: It might be worth triggering additional readahead here, to
+			 * avoid having to effectively do another synchronous IO for the
+			 * next block (if it were also a miss).
+			 */
 		}
 		else
 		{
-- 
2.53.0



  [application/octet-stream] v20-0012-aio-io_uring-Trigger-async-processing-for-large-.patch (7.2K, 9-v20-0012-aio-io_uring-Trigger-async-processing-for-large-.patch)
  download | inline diff:
From 95bf614fe55cf509c03ec2cbe7e191bf0e270a51 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 3 Mar 2026 20:23:55 -0500
Subject: [PATCH v20 12/17] aio: io_uring: Trigger async processing for large
 IOs

io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth. That heuristic is important when doing buffered IO from the
kernel page cache, to allow parallelizing of the memory copy, as otherwise
io_method=io_uring would be a lot slower than io_method=worker in that case.

An upcoming commit will make read_stream.c only increase the readahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async.  Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.

Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.

The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.

The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.

One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/method_io_uring.c | 90 +++++++++++++++++------
 1 file changed, 68 insertions(+), 22 deletions(-)

diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 39984df31..0c8fe4598 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -409,7 +409,6 @@ static int
 pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
-	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -425,27 +424,6 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 		pgaio_io_prepare_submit(ioh);
 		pgaio_uring_sq_from_io(ioh, sqe);
-
-		/*
-		 * io_uring executes IO in process context if possible. That's
-		 * generally good, as it reduces context switching. When performing a
-		 * lot of buffered IO that means that copying between page cache and
-		 * userspace memory happens in the foreground, as it can't be
-		 * offloaded to DMA hardware as is possible when using direct IO. When
-		 * executing a lot of buffered IO this causes io_uring to be slower
-		 * than worker mode, as worker mode parallelizes the copying. io_uring
-		 * can be told to offload work to worker threads instead.
-		 *
-		 * If an IO is buffered IO and we already have IOs in flight or
-		 * multiple IOs are being submitted, we thus tell io_uring to execute
-		 * the IO in the background. We don't do so for the first few IOs
-		 * being submitted as executing in this process' context has lower
-		 * latency.
-		 */
-		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
-			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
-
-		in_flight_before++;
 	}
 
 	while (true)
@@ -701,10 +679,64 @@ pgaio_uring_check_one(PgAioHandle *ioh, uint64 ref_generation)
 	LWLockRelease(&owner_context->completion_lock);
 }
 
+/*
+ * io_uring executes IO in process context if possible. That's generally good,
+ * as it reduces context switching. When performing a lot of buffered IO that
+ * means that copying between page cache and userspace memory happens in the
+ * foreground, as it can't be offloaded to DMA hardware as is possible when
+ * using direct IO. When executing a lot of buffered IO this causes io_uring
+ * to be slower than worker mode, as worker mode parallelizes the
+ * copying. io_uring can be told to offload work to worker threads instead.
+ *
+ * If the IOs are small, we only benefit from forcing things into the
+ * background if there is a lot of IO, as otherwise the overhead from context
+ * switching is higher than the gain.
+ *
+ * If IOs are large, there is benefit from asynchronous processing at lower
+ * queue depths, as IO latency is less of a crucial factor and parallelizing
+ * memory copies is more important.  In addition, it is important to trigger
+ * asynchronous processing even at low queue depth, as with foreground
+ * processing we might never actually reach deep enough IO depths to trigger
+ * asynchronous processing, which in turn would deprive readahead control
+ * logic of information about whether a deeper look-ahead distance would be
+ * advantageous.
+ *
+ * We have done some basic benchmarking to validate the thresholds used, but
+ * it's quite plausible that there are better values.
+ */
+static bool
+pgaio_uring_should_use_async(PgAioHandle *ioh, size_t io_size)
+{
+	/*
+	 * With DIO there's no benefit from forcing asynchronous processing, as
+	 * io_uring will never process IO completions in the foreground. The
+	 * kernel will use worker threads where appropriate.
+	 */
+	if (!(ioh->flags & PGAIO_HF_BUFFERED))
+		return false;
+
+	/*
+	 * Once the IO queue depth is not that shallow anymore, the overhead of
+	 * dispatching to the background is a less significant factor.
+	 */
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) > 4)
+		return true;
+
+	/*
+	 * If the IO is larger, the gains from parallelizing the memory copy are
+	 * larger and typically the impact of the latency is smaller.
+	 */
+	if (io_size >= (BLCKSZ * 4))
+		return true;
+
+	return false;
+}
+
 static void
 pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 {
 	struct iovec *iov;
+	size_t		io_size = 0;
 
 	switch ((PgAioOp) ioh->op)
 	{
@@ -717,6 +749,8 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 								   iov->iov_base,
 								   iov->iov_len,
 								   ioh->op_data.read.offset);
+
+				io_size = iov->iov_len;
 			}
 			else
 			{
@@ -726,7 +760,13 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 									ioh->op_data.read.iov_length,
 									ioh->op_data.read.offset);
 
+				for (int i = 0; i < ioh->op_data.read.iov_length; i++, iov++)
+					io_size += iov->iov_len;
 			}
+
+			if (pgaio_uring_should_use_async(ioh, io_size))
+				io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
 			break;
 
 		case PGAIO_OP_WRITEV:
@@ -747,6 +787,12 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
 									 ioh->op_data.write.iov_length,
 									 ioh->op_data.write.offset);
 			}
+
+			/*
+			 * For now don't trigger use of IOSQE_ASYNC for writes, it's not
+			 * clear there is a performance benefit in doing so.
+			 */
+
 			break;
 
 		case PGAIO_OP_INVALID:
-- 
2.53.0



  [application/octet-stream] v20-0011-read_stream-Prevent-distance-from-decaying-too-q.patch (5.0K, 10-v20-0011-read_stream-Prevent-distance-from-decaying-too-q.patch)
  download | inline diff:
From 66bbe25ccd7f2af335970a11e808da9cc9ac9b85 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sat, 28 Mar 2026 10:51:44 -0400
Subject: [PATCH v20 11/17] read_stream: Prevent distance from decaying too
 quickly

Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.

One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.

Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.

The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.

Several folks have written variants of this patch, at least Thomas Munro,
Melanie Plageman and I all have written variants of this.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 36 ++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 7893fdf03..fa27ec792 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	uint16		distance_decay_holdoff;
 	int16		initialized_buffers;
 	int16		resume_distance;
 	int			read_buffers_flags;
@@ -364,9 +365,22 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		/*
+		 * If there currently is no IO in progress, and we have not needed to
+		 * issue IO recently, decay the look-ahead distance.  We detect if we
+		 * had to issue IO recently by having a decay holdoff that's set to
+		 * the max look-ahead distance whenever we need to do IO.  This is
+		 * important to ensure we eventually reach a high enough distance to
+		 * perform IO asynchronously when starting out with a small look-ahead
+		 * distance.
+		 */
+		if (stream->distance > 1 && stream->ios_in_progress == 0)
+		{
+			if (stream->distance_decay_holdoff == 0)
+				stream->distance--;
+			else
+				stream->distance_decay_holdoff--;
+		}
 	}
 	else
 	{
@@ -702,6 +716,7 @@ read_stream_begin_impl(int flags,
 	stream->seq_blocknum = InvalidBlockNumber;
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
+	stream->distance_decay_holdoff = 0;
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -954,6 +969,20 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		distance = Min(distance, stream->max_pinned_buffers);
 		stream->distance = distance;
 
+		/*
+		 * As we needed IO, prevent distance from being reduced within our
+		 * maximum look-ahead window. This avoids having distance collapse too
+		 * quickly in workloads where most of the required blocks are cached,
+		 * but where the remaining IOs are a sufficient enough factor to cause
+		 * a substantial slowdown if executed synchronously.
+		 *
+		 * There are valid arguments for preventing decay for max_ios or for
+		 * max_pinned_buffers.  But the argument for max_pinned_buffers seems
+		 * clearer - if we can't see any misses within the maximum look-ahead
+		 * distance, we can't do any useful readahead.
+		 */
+		stream->distance_decay_holdoff = stream->max_pinned_buffers;
+
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
@@ -1128,6 +1157,7 @@ read_stream_reset(ReadStream *stream)
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
 	stream->resume_distance = stream->distance;
+	stream->distance_decay_holdoff = 0;
 }
 
 /*
-- 
2.53.0



  [application/octet-stream] v20-0009-Make-hash-index-AM-use-amgetbatch-interface.patch (47.3K, 11-v20-0009-Make-hash-index-AM-use-amgetbatch-interface.patch)
  download | inline diff:
From efca9a1164090d6e69b63ba238731f9e72a329f3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v20 09/17] Make hash index AM use amgetbatch interface.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface added by commit FIXME.  Plain index scans of
hash indexes now return matching items in batches consisting of all of
the matches from a given bucket or overflow page.  This gives the table
AM the ability to perform optimizations like index prefetching during
hash index scans.

batchImmediateUnguard is always true during hash index scans (since hash
doesn't support index-only scans and always uses an MVCC snapshot), so
the buffer pin interlock held on each batch is always dropped at unlock
time -- before the batch is even returned to the table AM.  This differs
from hash's previous approach of holding on to a pin for at least as
long as the scan was stopped on the pinned index page.

Guaranteeing that returned batches hold no buffer pins on index pages
greatly simplifies resource management during index prefetching, where
the read stream is expected to hold many pins on heap pages.  The
amgetbatch interface requires that index AMs take the same standardized
approach to pin management for pins that are used to prevent unsafe
concurrent TID recycling by VACUUM (that way prefetching can hold open
multiple batches without it affecting the read stream).  Note, however,
that hash still holds on to pins needed for its own internal purposes
(e.g., it'll still hold onto a pin during a bucket split).

hashkillitemsbatch (the hash implementation of the new amkillitemsbatch
interface) performs LP_DEAD marking of dead index entries, while
following slightly different rules to the old approach.  It relies on
comparing the batch's saved LSN against the current page LSN to detect
concurrent page modifications, which in turn requires fake LSN support
for unlogged relations.  Preparatory commit e5836f7b added that support
to the hash index AM.

hashunguardbatch is also provided, even though it is not currently
exercised: hash does not support index-only scans (and always uses an
MVCC snapshot), so batchImmediateUnguard is always true for hash index
scans with heapam (as noted already, in practice every batch's lock and
pin are released at unlock time during hash scans).  The callback is a
requirement for amgetbatch index AMs.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  81 ++-----
 src/backend/access/hash/README       |  31 +--
 src/backend/access/hash/hash.c       | 210 ++++++++++------
 src/backend/access/hash/hash_xlog.c  |   4 +-
 src/backend/access/hash/hashpage.c   |  21 +-
 src/backend/access/hash/hashsearch.c | 345 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   | 129 +---------
 doc/src/sgml/indexam.sgml            |  21 +-
 src/tools/pgindent/typedefs.list     |   3 +-
 9 files changed, 339 insertions(+), 506 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..e9ed795f4 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -18,6 +18,7 @@
 #define HASH_H
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "access/sdir.h"
 #include "catalog/pg_am_d.h"
@@ -100,57 +101,18 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
+/* Per-batch data private to the hash index AM */
+typedef struct HashBatchData
 {
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
+	Buffer		buf;			/* index page buffer pin */
+	BlockNumber currPage;		/* index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+} HashBatchData;
 
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
+/* Access the hash-private per-batch data from an IndexScanBatch pointer */
+#define HashBatchGetData(scan, batch) \
+	indexam_util_batch_get_amdata(scan, batch, HashBatchData)
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -178,15 +140,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +321,15 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch hashgetbatch(IndexScanDesc scan,
+								   IndexScanBatch priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void hashunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +402,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 IndexScanBatch priorbatch);
+extern IndexScanBatch _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +434,6 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index fc9031117..972bb666b 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -255,28 +255,29 @@ The reader algorithm is:
 		retake the buffer content lock on new bucket
 		arrange to scan the old bucket normally and the new bucket for
          tuples which are not moved-by-split
--- then, per read request:
+-- then, per batch (page) request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
 	the pin on the primary bucket throughout the scan)
-	save all the matching tuples from current index page into an items array
-	release pin and content lock (but if it is primary bucket page retain
-	its pin till the end of the scan)
-	get tuple from an item array
+	save all the matching tuples from current index page into a batch
+	release content lock on current page return batch to table AM (table AM
+	will drop batch's buffer pin, though primary bucket page pin is kept
+	until the end of the scan)
 -- at scan shutdown:
-	release all pins still held
+	release scan-owned pins (e.g., primary bucket page pin) as needed
 
 Holding the buffer pin on the primary bucket page for the whole scan prevents
-the reader's current-tuple pointer from being invalidated by splits or
-compactions.  (Of course, other buckets can still be split or compacted.)
+the bucket from being reorganized by splits or compactions while the scan is
+in progress.  (Of course, other buckets can still be split or compacted.)
 
-To minimize lock/unlock traffic, hash index scan always searches the entire
-hash page to identify all the matching items at once, copying their heap tuple
-IDs into backend-local storage. The heap tuple IDs are then processed while not
-holding any page lock within the index thereby, allowing concurrent insertion
-to happen on the same index page without any requirement of re-finding the
-current scan position for the reader. We do continue to hold a pin on the
-bucket page, to protect against concurrent deletions and bucket split.
+To minimize lock/unlock traffic, hash index scans always search the entire
+hash page to identify all the matching items at once, returning them in
+batches to the table AM.  The table AM processes batches while no page lock
+is held within the index, allowing concurrent insertion to happen on the
+same index page without any requirement of re-finding the current scan
+position for the reader.  The table AM controls when batch buffer pins are
+dropped.  We do continue to hold a pin on the primary bucket page, to
+protect against concurrent bucket splits.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2e32be233..65f38c93e 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -113,10 +113,10 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amkillitemsbatch = NULL,
-		.amunguardbatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amkillitemsbatch = hashkillitemsbatch,
+		.amunguardbatch = hashunguardbatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -299,53 +299,28 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+hashgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
+		Relation	rel = scan->indexRelation;
 
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
+		_hash_dropscanbuf(rel, so);
 
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -355,26 +330,26 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (int itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].tableTid, 1, true);
+			ntids++;
+		}
 
 		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
+		 * _hash_next releases the prior batch for bitmap callers before
+		 * allocating the next one, so only one batch is ever used at a time
 		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -396,17 +371,16 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
+	scan->batch_index_opaque_size = MAXALIGN(sizeof(HashBatchData));
+	scan->batch_tuples_workspace = 0;
 
 	return scan;
 }
@@ -421,18 +395,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -441,6 +405,111 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum,
+				maxoff;
+	bool		killedsomething = false;
+	XLogRecPtr	latestlsn;
+
+	Assert(batch->numDead > 0);
+
+	buf = _hash_getbuf(rel, hashbatch->currPage, HASH_READ,
+					   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_hash_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = HashPageGetOpaque(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate through batch->deadItems[] in index page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->tableTid))
+			{
+				if (!killedsomething)
+				{
+					/*
+					 * Use the hint bit infrastructure to check if we can
+					 * update the page while just holding a share lock. If we
+					 * are not allowed, there's no point continuing.
+					 */
+					if (!BufferBeginSetHintBits(buf))
+						goto unlock_page;
+				}
+
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint. Whenever
+	 * we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+unlock_page:
+	_hash_relbuf(rel, buf);
+}
+
+/*
+ *	hashunguardbatch() -- Drop batch's TID recycling interlock (buffer pin)
+ *
+ * Called by the table AM when it's safe to drop the buffer pin held to
+ * prevent concurrent TID recycling by VACUUM.
+ */
+void
+hashunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(hashbatch->buf);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -450,17 +519,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 2060620c7..e26ee8bb9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1141,14 +1141,14 @@ hash_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In hash bucket and overflow pages, it is possible to modify the
 		 * LP_FLAGS without emitting any WAL record. Hence, mask the line
-		 * pointer flags. See hashgettuple(), _hash_kill_items() for details.
+		 * pointer flags. See hashkillitemsbatch() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * It is possible that the hint bit LH_PAGE_HAS_DEAD_TUPLES may remain
-	 * unlogged. So, mask it. See _hash_kill_items() for details.
+	 * unlogged. So, mask it. See hashkillitemsbatch() for details.
 	 */
 	opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8099b0d02..11e3db472 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
-	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 89d1c5bc6..a6812c6a2 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,105 +22,94 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   IndexScanBatch batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   IndexScanBatch batch);
+static inline void _hash_saveitem(IndexScanBatch batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_hash_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
+	HashBatchData *hashpriorbatch = HashBatchGetData(scan, priorbatch);
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	IndexScanBatch batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * The core code must deal with cross-batch scan direction changes for us.
+	 * A batch management routine that flips priorbatch's scan direction is
+	 * used for this.
+	 */
+	Assert(priorbatch->dir == dir);
+
+	/*
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		blkno = hashpriorbatch->nextPage;
+	else
+		blkno = hashpriorbatch->prevPage;
 
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
-	}
+	/*
+	 * For bitmap scan callers, release the prior batch now so that the
+	 * allocation below can reuse its memory.  That way bitmap scans never
+	 * need more than one batch allocation.
+	 */
+	if (!scan->usebatchring)
+		indexam_util_batch_release(scan, priorbatch);
+
+	if (!BlockNumberIsValid(blkno))
+		return NULL;
+
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	if (end_of_scan)
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -270,22 +259,20 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin.
  */
-bool
+IndexScanBatch
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -296,7 +283,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	IndexScanBatch batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -326,7 +313,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -419,191 +406,152 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	hashbatch->buf = buf;
+	hashbatch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (hashbatch->buf == so->hashso_bucket_buf ||
+		hashbatch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by the core code/table AM)
+		 * before the hashso_bucket_buf/hashso_split_bucket_buf references are
+		 * released.
+		 */
+		IncrBufferRefCount(hashbatch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		hashbatch->prevPage = InvalidBlockNumber;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		hashbatch->prevPage = opaque->hasho_prevblkno;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* we saved one or more matches in batch.items[] */
+	indexam_util_batch_unlock(scan, batch, hashbatch->buf);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   IndexScanBatch batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -640,7 +588,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -687,7 +635,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -706,13 +654,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(IndexScanBatch batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
-	currItem->heapTid = itup->t_tid;
+	currItem->tableTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 081adbc88..331d5f4da 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -16,7 +16,6 @@
 
 #include "access/hash.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "port/pg_bitutils.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
@@ -33,7 +32,7 @@ _hash_checkqual(IndexScanDesc scan, IndexTuple itup)
 	/*
 	 * Currently, we can't check any of the scan conditions since we do not
 	 * have the original index entry value to supply to the sk_func. Always
-	 * return true; we expect that hashgettuple already set the recheck flag
+	 * return true; we expect that hashgetbatch already set the recheck flag
 	 * to make the main indexscan code do it.
 	 */
 #ifdef NOT_USED
@@ -505,129 +504,3 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
-
-/*
- * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
- * told us were killed.
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
- *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
- */
-void
-_hash_kill_items(IndexScanDesc scan)
-{
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber offnum,
-				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
-	bool		killedsomething = false;
-	bool		havePin = false;
-
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
-
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
-	{
-		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
-		 */
-		havePin = true;
-		buf = so->currPos.buf;
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-	}
-	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-
-	page = BufferGetPage(buf);
-	opaque = HashPageGetOpaque(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
-
-		offnum = currItem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-
-			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(buf))
-						goto unlock_page;
-				}
-
-				/* found the item */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint. Whenever
-	 * we mark anything LP_DEAD, we also set the page's
-	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
-	 */
-	if (killedsomething)
-	{
-		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		BufferFinishSetHintBits(buf, true, true);
-	}
-
-unlock_page:
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-	else
-		_hash_relbuf(rel, buf);
-}
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 37b75361e..8dd08d256 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -845,7 +845,8 @@ amgetbatch (IndexScanDesc scan,
    <function>tableam_util_free_batch</function>).  Note also that
    <function>amgetbatch</function> functions must never modify the
    <structfield>priorbatch</structfield> parameter.  The core
-   <filename>src/backend/access/nbtree/</filename> implementation provides a
+   <filename>src/backend/access/nbtree/</filename> and
+   <filename>src/backend/access/hash/</filename> implementations provide
    reference examples of the <function>amgetbatch</function> interface.
   </para>
 
@@ -921,8 +922,8 @@ amkillitemsbatch (IndexScanDesc scan,
   <para>
    While implementing <function>amkillitemsbatch</function> is optional,
    doing so is recommended for performance, as it allows future scans to skip
-   known-dead index entries.  The core index access method that currently
-   support <function>amgetbatch</function> (B-tree) implements
+   known-dead index entries.  Both core index access methods that currently
+   support <function>amgetbatch</function> (B-tree and hash) implement
    <literal>LP_DEAD</literal> marking, though third-party index access methods
    are free to choose whether to implement this feature.
    The table AM may call
@@ -965,8 +966,8 @@ amkillitemsbatch (IndexScanDesc scan,
    always safe to skip it.  Note that this LSN comparison technique requires
    the index AM to use fake (monotonically increasing) LSNs on its pages for
    relations where WAL is not generated, since real LSNs are not available in
-   that case.  See the B-tree index implementation for a reference
-   example of this technique.  An index AM that does not implement fake LSNs
+   that case.  See the B-tree and hash index implementations for reference
+   examples of this technique.  An index AM that does not implement fake LSNs
    can still provide <function>amkillitemsbatch</function>, but should simply
    do nothing when the relation does not generate WAL (i.e., when
    <function>RelationNeedsWAL()</function> is false), since the LSN comparison
@@ -993,9 +994,9 @@ amunguardbatch (IndexScanDesc scan,
    leaf page, which prevents concurrent TID recycling by
    <command>VACUUM</command>.
    Formally, an index AM may hold a different kind of interlock, or multiple
-   interlocks, in its per-batch opaque area, but in practice the built-in
-   index AM that supports <function>amgetbatch</function> &mdash; B-tree
-   &mdash; holds a single buffer pin.  See <xref linkend="index-locking"/>
+   interlocks, in its per-batch opaque area, but in practice both built-in
+   index AMs that support <function>amgetbatch</function> &mdash; B-tree and
+   hash &mdash; hold a single buffer pin.  See <xref linkend="index-locking"/>
    for details on buffer pin management during index scans.  This function
    will be called exactly once for each guarded batch.
   </para>
@@ -1065,8 +1066,8 @@ amgetbitmap (IndexScanDesc scan,
    <function>amgetbitmap</function> scans; during <function>amgetbatch</function>
    scans the <literal>priorbatch</literal> is strictly owned by the table AM
    and core code, and the index AM must never release it.  See
-   <function>_bt_next</function> for a
-   reference example.  The released batch is cached internally and reused by
+   <function>_bt_next</function> and <function>_hash_next</function> for
+   reference examples.  The released batch is cached internally and reused by
    the next <function>indexam_util_batch_alloc</function> call, avoiding
    repeated memory allocation during the bitmap scan.
   </para>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23e043cc0..2ee6d580e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1210,6 +1210,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchData
 HashBuildState
 HashBulkDeleteStreamPrivate
 HashCompareFunc
@@ -1231,8 +1232,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.53.0



  [application/octet-stream] v20-0008-heapam-Add-index-scan-I-O-prefetching.patch (45.1K, 12-v20-0008-heapam-Add-index-scan-I-O-prefetching.patch)
  download | inline diff:
From 798512b7972e6719b69422350fb96a82b55d639c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 25 Mar 2026 16:58:09 -0400
Subject: [PATCH v20 08/17] heapam: Add index scan I/O prefetching.

This commit implements I/O prefetching for index scans (and index-only
scans that require heap fetches). This was made possible by the recent
addition of batching interfaces to both the table AM and index AM APIs.

The amgetbatch index AM interface provides batches of matching TIDs
(rather than one tuple at a time), each of which must be taken from
index tuples that appear together on a single index page.  This allows
multiple batches to be held open simultaneously.  Giving the table AM an
explicit understanding of index AM concepts/index page boundaries allows
it to consider all of the relevant costs and benefits.

Prefetching is implemented using a prefetching position under the
control of the table AM and core code.  This is closely related to the
scan position added by commit FIXME, which introduced the amgetbatch
interface.  A read stream callback advances the read stream as needed to
provide sufficiently many heap block numbers to maintain the read
stream's target prefetch distance.

Testing has shown that index prefetching can make index scans much
faster.  Large range scans that return many tuples can be as much as 30x
faster with local SSDs when buffered I/O is used, and 50x faster or more
with higher-latency storage such as network-attached block devices,
where the benefit of hiding I/O latency through prefetching is even
greater.

A new GUC (enable_indexscan_prefetch) controls the use of index
prefetching.  The default setting is 'on', so all plain index scans use
prefetching where support exists.  All index-only scans will also use
prefetching automatically where supported (once the scan starts to
require a significant number of heap fetches).

An important goal of the amgetbatch design is to enable the table AM's
read stream callback to advance its prefetch position using TIDs that
appear on a leaf page that's ahead of the current scan position's leaf
page.  This is crucial with scans of indexes where each leaf page
happens to have relatively few distinct heap blocks among its matching
TIDs (as well as with scans with leaf pages that have relatively few
total matching items).  Index scans can have as many as 64 open batches,
which testing has shown to be about the maximum number that can ever be
useful.  Batches are maintained in scan order using a simple ring buffer
data structure.

In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
the read stream is temporarily paused using the read stream pausing
mechanism added by commit 38229cb9.  Prefetching (via the read stream)
is resumed only after the scan position advances beyond its current open
batch and then frees and removes the batch from the scan's batch ring
buffer.  Testing has shown that it isn't very common for scans to hold
open more than about 10 batches to get the desired I/O prefetch
distance.

The heuristic used to decide when to begin prefetching delays
initialization of the scan's read stream until the scan transitions from
its first batch to its second batch.  Each batch corresponds to matching
TIDs from a single index leaf page, so prefetching only begins once the
scan reads from its second leaf page containing at least one matching
item.  A selective index scan that touches only one leaf page never
reaches the second batch, so the heuristic correctly avoids prefetching
overhead.  The picture for more complicated cases is mixed.

The same principle applies to nestloop inner index scans with very tight
limits (e.g., a correlated subquery with LIMIT 1), where each rescan
reads from only a single leaf page: the heuristic avoids the cost of
repeatedly resetting a read stream across many rescans.  On the other
hand, some selective scans that access randomly-ordered heap pages would
genuinely benefit from prefetching, but never get as far as reaching
their second batch -- a missed opportunity, where the heuristic is
overly cautious.

Conversely, the heuristic is not cautious enough with slightly less
selective nestloop inner scans (e.g., LIMIT 3 within a LATERAL join).
These rescans may span two leaf pages, just barely crossing the
second-batch threshold, while still only needing to fetch two or three
heap pages -- not enough for prefetching to realistically help or pay
for itself on any individual rescan.  Such queries are regressed by the
work from this commit (relative to PostgreSQL 18), though only when the
scan has to read heap pages from storage.

Adding a smarter heuristic that addresses both shortcomings remains as
work for a future release.  Passing down an ExecSetTupleBound style hint
and using that hint to influence how the read stream ramps up its
distance seems like a promising approach.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/include/access/heapam.h                   |  10 +
 src/include/access/indexbatch.h               |   8 +-
 src/include/access/relscan.h                  |  44 ++
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_indexscan.c    | 463 +++++++++++++++++-
 src/backend/access/index/indexbatch.c         |  26 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  16 +
 doc/src/sgml/indexam.sgml                     |  66 ++-
 doc/src/sgml/tableam.sgml                     |   8 +
 src/test/regress/expected/sysviews.out        |   3 +-
 13 files changed, 641 insertions(+), 13 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index c3bb89538..7d7b07767 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -135,6 +135,16 @@ typedef struct IndexFetchHeapData
 	/* Plain index scan xs_lastinblock optimization */
 	bool		xs_lastinblock; /* last TID on this block in current batch? */
 
+	/*
+	 * The read stream is allocated early in the scan, and reset on rescan.
+	 * This reset process releases all pending pinned buffers.  The read
+	 * stream is also reset when we detect a scan direction change.
+	 */
+	bool		xs_paused;		/* paused until next batch is read? */
+	ScanDirection xs_read_stream_dir;	/* index scan direction */
+	BlockNumber xs_prefetch_block;	/* last block returned to xs_read_stream */
+	ReadStream *xs_read_stream; /* prefetching read stream */
+
 } IndexFetchHeapData;
 
 /*
diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h
index 4265ad7de..265379288 100644
--- a/src/include/access/indexbatch.h
+++ b/src/include/access/indexbatch.h
@@ -48,16 +48,16 @@ extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch)
 /*
  * Fetch the next batch of matching items for the scan (or the first).
  *
- * Called when caller's current batch (passed to us as priorBatch) has no more
- * matching items in the given scan direction.  Caller passes a NULL
- * priorBatch on the first call here for the scan.
+ * Called when caller's current scanBatch/prefetchBatch (passed to us as
+ * priorBatch) has no more matching items in the given scan direction.  Caller
+ * passes a NULL priorBatch on the first call here for the scan.
  *
  * Returns the next batch to be processed by caller in the given scan
  * direction, or NULL when there are no more matches in that direction.
  *
  * This is where batches are appended to the scan's ring buffer.  We don't
  * free any batches here, though; that is left up to the caller.  The caller
- * is also responsible for advancing their position.
+ * is also responsible for advancing their scanPos/prefetchPos position.
  */
 static pg_attribute_always_inline IndexScanBatch
 tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e23540851..ba3333ff2 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -214,6 +214,10 @@ typedef struct IndexScanBatchData
 	 * This allows table AMs to avoid redundant amgetbatch calls with the same
 	 * priorbatch -- the index AM might need to read additional index pages to
 	 * determine there are no more matching items beyond caller's priorbatch.
+	 * In particular, during prefetching the read stream callback discovers
+	 * the end-of-scan via prefetchBatch.  tableam_util_fetch_next_batch()
+	 * checks these flags so that the scan side doesn't repeat the same
+	 * amgetbatch call when it later reaches that batch as scanBatch.
 	 */
 	bool		knownEndBackward;
 	bool		knownEndForward;
@@ -268,11 +272,20 @@ typedef struct IndexScanBatchData *IndexScanBatch;
  * to fetch table tuples in whatever order is most convenient -- provided that
  * such reordering cannot affect the order that table_index_getnext_slot later
  * returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
  */
 typedef struct BatchRingBuffer
 {
 	/* current positions in IndexScanDescData.batchbuf[] for scan */
 	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos prefetchPos;	/* prefetching position */
 	BatchRingItemPos markPos;	/* mark/restore position */
 
 	/* markPos's batch (not in ring buffer when markBatch != scanBatch) */
@@ -508,6 +521,37 @@ index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)
 	ringbuf->nextBatch++;
 }
 
+/*
+ * Compare two batch ring positions in the given scan direction.
+ *
+ * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is
+ * ahead of pos2.
+ */
+static inline int
+index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2,
+				   ScanDirection direction)
+{
+	int8		batchdiff;
+
+	Assert(pos1->valid && pos2->valid);
+
+	batchdiff = (int8) (pos1->batch - pos2->batch);
+	if (batchdiff != 0)
+	{
+		/* Resolve comparison using differing batch offsets */
+		return batchdiff;
+	}
+
+	/*
+	 * Resolve comparison using items[]-wise indexes from caller's positions,
+	 * since both positions point to the same ring buffer batch
+	 */
+	if (ScanDirectionIsForward(direction))
+		return pos1->item - pos2->item;
+	else
+		return pos2->item - pos1->item;
+}
+
 /*
  * Advance position to its next item in the batch.
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d315..419300a6b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 7a6b49ee5..9d4fa5583 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -19,6 +19,7 @@
 #include "access/indexbatch.h"
 #include "access/relscan.h"
 #include "access/visibilitymap.h"
+#include "optimizer/cost.h"
 #include "storage/predicate.h"
 #include "utils/pgstat_internal.h"
 
@@ -55,6 +56,14 @@ static void heapam_index_batch_pos_visibility(IndexScanDesc scan,
 											  IndexScanBatch batch,
 											  HeapBatchData *hbatch,
 											  BatchRingItemPos *pos);
+static pg_noinline void heapam_index_dirchange_reset(IndexFetchHeapData *hscan,
+													 ScanDirection direction,
+													 BatchRingBuffer *batchringbuf);
+static pg_attribute_always_inline void heapam_index_consider_prefetching(IndexScanDesc scan,
+																		 IndexFetchHeapData *hscan);
+static BlockNumber heapam_index_prefetch_next_block(ReadStream *stream,
+													void *callback_private_data,
+													void *per_buffer_data);
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -81,6 +90,10 @@ heapam_index_fetch_begin(Relation rel, uint32 flags)
 	/* xs_lastinblock optimization state */
 	Assert(!hscan->xs_lastinblock);
 
+	/* Read stream state (other fields initialized by callback) */
+	Assert(hscan->xs_read_stream_dir == NoMovementScanDirection);
+	Assert(hscan->xs_read_stream == NULL);
+
 	/*
 	 * Return opaque state, which we'll access through the scan's xs_heapfetch
 	 * field later on.
@@ -98,6 +111,18 @@ heapam_index_fetch_reset(IndexScanDesc scan)
 	/* Rescans should avoid an excessive number of VM lookups */
 	hscan->xs_vm_items = 1;
 
+	/* Defensively do an unconditional read stream direction reset */
+	hscan->xs_read_stream_dir = NoMovementScanDirection;
+
+	/*
+	 * Reset read stream itself, and other associated state.
+	 */
+	if (hscan->xs_read_stream)
+	{
+		hscan->xs_paused = false;
+		read_stream_reset(hscan->xs_read_stream);
+	}
+
 	/* Reset batch ring buffer state */
 	if (scan->usebatchring)
 		tableam_util_batchscan_reset(scan, false);
@@ -114,7 +139,14 @@ heapam_index_fetch_restrpos(IndexScanDesc scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
-	(void) hscan;
+	/*
+	 * Reset read stream itself, and other associated state.
+	 */
+	if (hscan->xs_read_stream)
+	{
+		hscan->xs_paused = false;
+		read_stream_reset(hscan->xs_read_stream);
+	}
 
 	/* Restore batch ring to previously saved mark */
 	tableam_util_batchscan_restore_pos(scan);
@@ -133,6 +165,9 @@ heapam_index_fetch_end(IndexScanDesc scan)
 	if (BufferIsValid(hscan->xs_vmbuffer))
 		ReleaseBuffer(hscan->xs_vmbuffer);
 
+	if (hscan->xs_read_stream)
+		read_stream_end(hscan->xs_read_stream);
+
 	/* Free all batch related resources */
 	if (scan->usebatchring)
 		tableam_util_batchscan_end(scan);
@@ -459,7 +494,14 @@ heapam_index_fetch_tuple_impl(Relation rel,
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
-		hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
+		/*
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
+		 */
+		if (hscan->xs_read_stream)
+			hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -736,6 +778,12 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan
 	Assert(scanPos->valid || index_scan_batch_count(scan) == 0);
 	Assert(all_visible == NULL || scan->xs_want_itup);
 
+	/* Handle resetting the read stream when scan direction changes */
+	if (hscan->xs_read_stream_dir == NoMovementScanDirection)
+		hscan->xs_read_stream_dir = direction;	/* first call */
+	else if (unlikely(hscan->xs_read_stream_dir != direction))
+		heapam_index_dirchange_reset(hscan, direction, batchringbuf);
+
 	/*
 	 * Check if there's an existing loaded scanBatch for us to return the next
 	 * matching item's TID/index tuple from
@@ -745,7 +793,7 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan
 	{
 		/*
 		 * scanPos is valid, so scanBatch must already be loaded in batch ring
-		 * buffer.  We rely on that here.
+		 * buffer.  We rely on that here (can't do this with prefetchBatch).
 		 */
 		pg_assume(batchringbuf->headBatch == scanPos->batch);
 
@@ -776,6 +824,17 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan
 		return NULL;
 	}
 
+	if (hadExistingScanBatch && !hscan->xs_read_stream)
+	{
+		Assert(!scan->batchringbuf.prefetchPos.valid);
+
+		/*
+		 * Not using a read stream to do index prefetching.  Decide whether to
+		 * start one now.
+		 */
+		heapam_index_consider_prefetching(scan, hscan);
+	}
+
 	/*
 	 * Advanced scanBatch.  Now position scanPos to the start of new
 	 * scanBatch.
@@ -791,6 +850,7 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan
 	{
 		IndexScanBatch headBatch = index_scan_batch(scan,
 													batchringbuf->headBatch);
+		BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
 
 		Assert(headBatch != scanBatch);
 		Assert(batchringbuf->headBatch != scanPos->batch);
@@ -798,12 +858,47 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan
 		/* free obsolescent head batch (unless it is scan's markBatch) */
 		tableam_util_free_batch(scan, headBatch);
 
+		/*
+		 * If we're about to release the batch that prefetchPos currently
+		 * points to, just invalidate prefetchPos.  See the comments about
+		 * prefetchPos/scanPos within heapam_index_prefetch_next_block for an
+		 * explanation.
+		 *
+		 * This handling is approximately the opposite of resuming a paused
+		 * read stream: this helps the scan deal with prefetchPos falling
+		 * behind scanPos, whereas pausing is used when scanPos has fallen
+		 * behind (very far behind) prefetchPos.
+		 */
+		if (prefetchPos->valid &&
+			prefetchPos->batch == batchringbuf->headBatch)
+			prefetchPos->valid = false;
+
 		/* Remove the batch from the ring buffer (even if it's markBatch) */
 		batchringbuf->headBatch++;
+
+		if (unlikely(hscan->xs_paused))
+		{
+			/*
+			 * heapam_index_prefetch_next_block paused the scan's read stream
+			 * due to our running out of free batch slots.  Now that we've
+			 * freed up one such slot, we can resume the read stream (since
+			 * there's now space for heapam_index_prefetch_next_block to store
+			 * one more batch).
+			 */
+			Assert(prefetchPos->batch != scanPos->batch);
+			Assert(prefetchPos->valid &&
+				   index_scan_batch_loaded(scan, prefetchPos->batch));
+			Assert(index_scan_pos_cmp(prefetchPos, scanPos, direction) > 0);
+			Assert(!index_scan_batch_full(scan));
+
+			read_stream_resume(hscan->xs_read_stream);
+			hscan->xs_paused = false;
+		}
 	}
 
 	/* In practice scanBatch will always be the ring buffer's headBatch */
 	Assert(batchringbuf->headBatch == scanPos->batch);
+	Assert(!hscan->xs_paused);
 
 	return heapam_index_return_scanpos_tid(scan, hscan, direction,
 										   scanBatch, scanPos, all_visible);
@@ -910,6 +1005,13 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
  * (important for inner index scans of anti-joins and semi-joins), and the
  * need to unguard batches promptly.
  *
+ * In no event will the scan be allowed to guard more than one batch at a
+ * time.  The primary reason for this restriction is to avoid unintended
+ * interactions with the read stream, which has its own strategy for keeping
+ * the number of pins held by the backend under control.  (Unguarding via
+ * the amunguardbatch callback often means releasing a buffer pin on an
+ * index page, which counts against the same shared pin limit.)
+ *
  * Once we've resolved visibility for all items in a batch, we can safely
  * unguard it by calling amunguardbatch.  This is safe with respect to
  * concurrent VACUUM because the batch's guard (typically a buffer pin on the
@@ -1036,3 +1138,358 @@ heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction,
 	else
 		hscan->xs_vm_items = scan->maxitemsbatch;
 }
+
+/*
+ * Handle a change in index scan direction (at the tuple granularity).
+ *
+ * Resets the read stream, since we can't rely on scanPos continuing to agree
+ * with the blocks that read stream already consumed using prefetchPos.
+ *
+ * Note: iff the scan _continues_ in this new direction, and actually steps
+ * off scanBatch to an earlier index page, tableam_util_fetch_next_batch will
+ * deal with it.  But that might never happen; the scan might yet change
+ * direction again (or just end before returning more items).
+ */
+static pg_noinline void
+heapam_index_dirchange_reset(IndexFetchHeapData *hscan,
+							 ScanDirection direction,
+							 BatchRingBuffer *batchringbuf)
+{
+	/* Reset read stream state */
+	batchringbuf->prefetchPos.valid = false;
+	hscan->xs_paused = false;
+	hscan->xs_read_stream_dir = direction;
+
+	/* Reset read stream itself */
+	if (hscan->xs_read_stream)
+		read_stream_reset(hscan->xs_read_stream);
+}
+
+/*
+ * Decide whether to start a read stream for heap block prefetching during an
+ * index scan.
+ *
+ * Called each time a new batch is obtained from the index AM, barring the
+ * first time that happens.  We delay initializing the stream until reading
+ * from the scan's second batch.  This heuristic avoids wasting cycles on
+ * starting a read stream for very selective index scans.
+ *
+ * We avoid prefetching during scans where we're unable to unguard (unpin)
+ * each batch's buffers right away (non-MVCC snapshot scans).  We are not
+ * prepared to sensibly limit the total number of buffer pins held (read
+ * stream handles all pin resource management for us, and knows nothing
+ * about pins held on index pages/within batches).
+ *
+ * We also delay creating a read stream during index-only scans that haven't
+ * done any heap fetches yet.  We don't want to waste any cycles on
+ * allocating a read stream until we have a demonstrated need to perform
+ * heap fetches.
+ */
+static pg_attribute_always_inline void
+heapam_index_consider_prefetching(IndexScanDesc scan,
+								  IndexFetchHeapData *hscan)
+{
+	Assert(!hscan->xs_read_stream);
+	Assert(!scan->batchringbuf.prefetchPos.valid);
+
+	if (scan->MVCCScan && enable_indexscan_prefetch &&
+		hscan->xs_blk != InvalidBlockNumber)	/* for index-only scans */
+		hscan->xs_read_stream =
+			read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+									   scan->heapRelation, MAIN_FORKNUM,
+									   heapam_index_prefetch_next_block,
+									   scan, 0);
+	/* else don't start a read stream for prefetching (not yet, at least) */
+}
+
+/*
+ * Return the next block to the read stream when performing index prefetching.
+ *
+ * The initial batch is always loaded by heapam_index_getnext_scanbatch_pos.
+ * We don't get called until the first read_stream_next_buffer call, when a
+ * heap block is requested from the scan's stream for the first time.
+ *
+ * The position of the read_stream is stored in prefetchPos.  It is typical
+ * for prefetchPos to consistently stay ahead of the scanPos position that's
+ * used to track the next TID heapam_index_getnext_scanbatch_pos will return
+ * to the scan (after the first time we get called).  However, that isn't a
+ * strict precondition (though as explained below we implement a scheme
+ * essentially equivalent to making it a strict precondition).  There is a
+ * true strict postcondition, though: when we return we'll always leave
+ * scanPos <= prefetchPos.
+ */
+static BlockNumber
+heapam_index_prefetch_next_block(ReadStream *stream,
+								 void *callback_private_data,
+								 void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
+	ScanDirection xs_read_stream_dir = hscan->xs_read_stream_dir;
+	IndexScanBatch prefetchBatch;
+	bool		fromScanPos = false;
+
+	/*
+	 * scanPos must always be valid when prefetching takes place.  There has
+	 * to be at least one batch, loaded as our scanBatch.  The scan direction
+	 * must be established, too.
+	 */
+	Assert(index_scan_batch_count(scan) > 0);
+	Assert(scanPos->valid && index_scan_batch_loaded(scan, scanPos->batch));
+	Assert(scan->MVCCScan);
+	Assert(!hscan->xs_paused);
+	Assert(xs_read_stream_dir != NoMovementScanDirection);
+
+	/*
+	 * Handle initialization on the first call here, when prefetchPos isn't
+	 * yet valid (also handles the prefetchPos < scanPos edge case).
+	 *
+	 * If prefetchPos has not been initialized yet, that typically indicates
+	 * that this is the first call here for the entire scan.  We initialize
+	 * prefetchPos using the current scanPos, since the current scanBatch
+	 * item's TID should have its block number returned by the read stream
+	 * first.  When this happens, it's likely that prefetchPos will get ahead
+	 * of scanPos very soon, after the _next_ call here returns.
+	 *
+	 * There's also an edge case that we handle using exactly the same steps.
+	 * It's possible for prefetchPos to "fall behind" scanPos, at least in a
+	 * trivial sense: if many adjacent matching items contain TIDs that all
+	 * point to the same heap block, scanPos can actually overtake prefetchPos
+	 * (prefetchPos can't advance until we're actually called).  A similar
+	 * issue arises during index-only scans that require only a few heap
+	 * fetches: we'll tend to be called far less often than we'd be called
+	 * during an equivalent plain index scan due to all-visible items.  An
+	 * all-visible item will advance scanPos, but can't trigger a call to here
+	 * (just like an item that points to the same heap block that the previous
+	 * item also pointed to).
+	 *
+	 * This scheme produces exactly the same block prefetch requests as a
+	 * scheme that requires heapam_index_getnext_scanbatch_pos to actively
+	 * ensure that "prefetchPos < scanPos" can never happen.  That isn't a
+	 * strict precondition for this function because making it explicit would
+	 * impose a performance penalty on heapam_index_getnext_scanbatch_pos.
+	 *
+	 * Note: when heapam_index_getnext_scanbatch_pos frees a batch that
+	 * prefetchPos points to, it'll at least invalidate prefetchPos for us.
+	 * This removes any danger of prefetchPos.batch falling so far behind
+	 * scanPos.batch that it wraps around (and appears to be ahead of scanPos
+	 * instead of behind it).  In other words, in a certain sense we actually
+	 * _can_ trust heapam_index_getnext_scanbatch_pos to not let prefetchPos
+	 * fall behind scanPos: it can't happen at the batch granularity (only at
+	 * the item/tuple granularity, which we can always cope with here).
+	 */
+	if (!prefetchPos->valid ||
+		index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 0)
+	{
+		hscan->xs_prefetch_block = InvalidBlockNumber;
+		*prefetchPos = *scanPos;
+		fromScanPos = true;
+
+		/*
+		 * We must avoid keeping any batch guarded for more than an instant,
+		 * to avoid undesirable interactions with the scan's read stream. See
+		 * comment and assertion at the top of the loop below.
+		 */
+		if (scan->xs_want_itup)
+		{
+			/*
+			 * Make heapam_index_batch_pos_visibility release resources
+			 * eagerly
+			 */
+			hscan->xs_vm_items = scan->maxitemsbatch;
+
+			/* Make sure that this new prefetchBatch is unguarded */
+			prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+			if (prefetchBatch->isGuarded)
+			{
+				HeapBatchData *hbatch = heap_batch_data(scan, prefetchBatch);
+
+				/* Set visibility info not set through scanBatch */
+				heapam_index_batch_pos_visibility(scan, xs_read_stream_dir,
+												  prefetchBatch, hbatch,
+												  prefetchPos);
+			}
+		}
+	}
+
+	prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+
+	/*
+	 * If prefetchPos wasn't just initialized using scanPos, we're directly
+	 * picking up prefetching where the last call here left off.  Assert that
+	 * xs_prefetch_block matches the last item we returned as expected.
+	 *
+	 * Note: we don't actually need a xs_prefetch_block field at all; we could
+	 * just take the last block we returned from prefetchPos directly instead.
+	 * But maintaining xs_prefetch_block explicitly is slightly more robust.
+	 * It gives us a way to make sure that the last call here left prefetchPos
+	 * in a consistent state (e.g., when the read stream had to be paused).
+	 */
+#ifdef USE_ASSERT_CHECKING
+	if (!fromScanPos)
+	{
+		BatchMatchingItem *lastitem = &prefetchBatch->items[prefetchPos->item];
+		BlockNumber last_block = ItemPointerGetBlockNumber(&lastitem->tableTid);
+
+		Assert(last_block == hscan->xs_prefetch_block);
+	}
+#endif
+
+	for (;;)
+	{
+		BatchMatchingItem *item;
+		BlockNumber prefetch_block;
+
+		/*
+		 * We never call amgetbatch without immediately unguarding the batch,
+		 * either within the index AM or here (when we eagerly load all of the
+		 * batch's visibility information during an index-only scan).  The
+		 * index AM won't hold onto TID interlock buffer pins, keeping the
+		 * absolute number of pins held to a minimum.
+		 *
+		 * This is defensive.  The read stream tries to be careful about not
+		 * pinning too many buffers, and that's harder to do reliably if there
+		 * are variable numbers of pins taken without such care.
+		 */
+		Assert(!prefetchBatch->isGuarded);
+		if (fromScanPos)
+		{
+			/*
+			 * Don't increment item when prefetchPos was just initialized
+			 * using scanPos.  We'll return the scanPos item's heap block
+			 * directly on the first call here.  In other words, we'll return
+			 * the heap block from TID passed to heapam_index_fetch_tuple_impl
+			 * at the point where it called read_stream_next_buffer for the
+			 * first time during the scan. (As explained above, we also end up
+			 * here during the first call to read_stream_next_buffer following
+			 * prefetchPos falling behind scanPos/being invalidated for us.)
+			 */
+			fromScanPos = false;
+		}
+		else if (!index_scan_pos_advance(xs_read_stream_dir,
+										 prefetchBatch, prefetchPos))
+		{
+			/*
+			 * Ran out of items from prefetchBatch.  Try to advance to the
+			 * scan's next batch.
+			 */
+			if (unlikely(index_scan_batch_full(scan)))
+			{
+				/*
+				 * Can't advance prefetchBatch because all available
+				 * batchringbuf batch slots are currently in use.
+				 *
+				 * Deal with this by momentarily pausing the read stream.
+				 * heapam_index_getnext_scanbatch_pos will resume the read
+				 * stream later, though only after scanPos has consumed all
+				 * remaining items from scanBatch (at which point the current
+				 * head batch will be freed, making a slot available for reuse
+				 * here by us).
+				 *
+				 * In practice we hardly ever need to do this.  It would be
+				 * possible to avoid the need to pause the read stream by
+				 * dynamically allocating slots, but that would add complexity
+				 * for no real benefit.  It also seems like a good idea to
+				 * impose some hard limit on the number of batches that
+				 * prefetchPos can get ahead of scanPos by (especially in the
+				 * case of index-only scans, where we often won't have any
+				 * heap block to return from most of the scan's batches).
+				 */
+				hscan->xs_paused = true;
+
+				/*
+				 * Before returning, advance prefetchPos in the opposite
+				 * direction to the one used by the scan.  This undoes the
+				 * effects of the most recent advance.  We're not going to
+				 * return any block, so it seems like a good idea to leave
+				 * prefetchPos in a state consistent with that.
+				 */
+				if (ScanDirectionIsForward(xs_read_stream_dir))
+				{
+					Assert(prefetchPos->item == prefetchBatch->lastItem + 1);
+					prefetchPos->item = prefetchBatch->lastItem;
+				}
+				else
+				{
+					Assert(prefetchPos->item == prefetchBatch->firstItem - 1);
+					prefetchPos->item = prefetchBatch->firstItem;
+				}
+
+				return read_stream_pause(stream);
+			}
+
+			prefetchBatch = tableam_util_fetch_next_batch(scan,
+														  xs_read_stream_dir,
+														  prefetchBatch,
+														  prefetchPos);
+			if (!prefetchBatch)
+			{
+				/*
+				 * No more batches in this direction, so all the batches that
+				 * the scan will ever require (barring a change in scan
+				 * direction) are now loaded
+				 */
+				return InvalidBlockNumber;
+			}
+
+			/* Position prefetchPos to the start of new prefetchBatch */
+			index_scan_pos_nextbatch(xs_read_stream_dir,
+									 prefetchBatch, prefetchPos);
+
+			if (scan->xs_want_itup)
+			{
+				HeapBatchData *hbatch = heap_batch_data(scan, prefetchBatch);
+
+				/* make sure we have visibility info for the entire batch */
+				Assert(hscan->xs_vm_items == scan->maxitemsbatch);
+				heapam_index_batch_pos_visibility(scan, xs_read_stream_dir,
+												  prefetchBatch, hbatch,
+												  prefetchPos);
+			}
+		}
+
+		/*
+		 * prefetchPos now points to the next item whose TID's heap block
+		 * number might need to be prefetched
+		 */
+		Assert(index_scan_batch(scan, prefetchPos->batch) == prefetchBatch);
+		Assert(prefetchPos->item >= prefetchBatch->firstItem &&
+			   prefetchPos->item <= prefetchBatch->lastItem);
+		/* scanPos is always <= prefetchPos when we return */
+		Assert(index_scan_pos_cmp(scanPos, prefetchPos, xs_read_stream_dir) <= 0);
+
+		if (scan->xs_want_itup)
+		{
+			HeapBatchData *hbatch = heap_batch_data(scan, prefetchBatch);
+
+			Assert(hbatch->visInfo[prefetchPos->item] & HEAP_BATCH_VIS_CHECKED);
+			if (hbatch->visInfo[prefetchPos->item] & HEAP_BATCH_VIS_ALL_VISIBLE)
+			{
+				/* item is known to be all-visible -- don't prefetch */
+				continue;
+			}
+		}
+
+		item = &prefetchBatch->items[prefetchPos->item];
+		prefetch_block = ItemPointerGetBlockNumber(&item->tableTid);
+
+		if (prefetch_block == hscan->xs_prefetch_block)
+		{
+			/*
+			 * prefetch_block matches the last prefetchPos item's TID's heap
+			 * block number; we must not return the same prefetch_block twice
+			 * (twice in succession)
+			 */
+			continue;
+		}
+
+		/* We have a new heap block number to return to read stream */
+		hscan->xs_prefetch_block = prefetch_block;
+		return prefetch_block;
+	}
+
+	return InvalidBlockNumber;
+}
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 46876344b..813c2890e 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -69,6 +69,7 @@ batchscan_init(IndexScanDesc scan)
 	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
 
 	scan->batchringbuf.scanPos.valid = false;
+	scan->batchringbuf.prefetchPos.valid = false;
 	scan->batchringbuf.markPos.valid = false;
 
 	scan->batchringbuf.markBatch = NULL;
@@ -149,7 +150,14 @@ batchscan_mark_pos(IndexScanDesc scan)
  * current scanBatch when needed.
  *
  * We just discard all batches (other than markBatch/restored scanBatch),
- * except when markBatch is already the scan's current scanBatch.
+ * except when markBatch is already the scan's current scanBatch.  We always
+ * invalidate prefetchPos.  The read stream and related prefetching state are
+ * reset by the table AM's index_fetch_restrpos callback (which calls this
+ * function after resetting its own state).  This approach keeps things simple
+ * for table AMs: most code that deals with batches is thereby able to assume
+ * that the common case where scan direction never changes is the only case
+ * (tableam_util_scanbatch_dirchange takes a similar approach to handling a
+ * cross-batch change in scan direction).
  */
 void
 tableam_util_batchscan_restore_pos(IndexScanDesc scan)
@@ -164,6 +172,14 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan)
 	Assert(scan->xs_heapfetch);
 	Assert(markPos->valid);
 
+	/*
+	 * Restoring a mark always requires stopping prefetching.  This is similar
+	 * to the handling table AMs implement to deal with a tuple-level change
+	 * in the scan's direction.  The read stream must have already been reset
+	 * by the caller (via table_index_fetch_reset).
+	 */
+	batchringbuf->prefetchPos.valid = false;
+
 	if (scanBatch == markBatch)
 	{
 		/* markBatch is already scanBatch; needn't change batchringbuf */
@@ -226,6 +242,7 @@ tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan)
 	bool		markBatchFreed = false;
 
 	batchringbuf->scanPos.valid = false;
+	batchringbuf->prefetchPos.valid = false;
 	batchringbuf->markPos.valid = false;
 
 	/* Ensure batch_free won't skip the old markBatch in the loop below */
@@ -286,6 +303,13 @@ tableam_util_batchscan_end(IndexScanDesc scan)
  * to determine which batch comes next in the new scan direction.  This
  * approach isn't particularly efficient, but it works well enough for what
  * ought to be a relatively rare occurrence.
+ *
+ * Caller must have reset the scan's read stream before calling here.  That
+ * needs to happen as soon as the scan requests a tuple in whatever scan
+ * direction is opposite-to-current.  We only deal with the case where the
+ * scan backs up by enough items to cross a batch boundary (when the scan
+ * resumes scanning in its original direction/ends before crossing a boundary,
+ * there isn't any need to call here).
  */
 void
 tableam_util_scanbatch_dirchange(IndexScanDesc scan)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1c575e56f..6fcb815f7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -146,6 +146,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0a862693f..2b5620f9c 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -931,6 +931,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf1559738..2135ea524 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -420,6 +420,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 229f41353..336c621d5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5712,6 +5712,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexscan-prefetch" xreflabel="enable_indexscan_prefetch">
+      <term><varname>enable_indexscan_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexscan_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables prefetching for index scan and index-only scan
+        plan types.  Prefetching can improve performance by reading table AM
+        pages ahead of when they are needed during index scans.  The default
+        is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 2b48728c5..37b75361e 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -807,9 +807,12 @@ amgetbatch (IndexScanDesc scan,
   <para>
    The <function>amgetbatch</function> interface is an alternative to
    <function>amgettuple</function> that returns matching index entries in batches
-   rather than one at a time.  By returning all matching index entries from a
-   single index page together, the table AM gains visibility into which table
-   blocks will be needed in the near future.
+   rather than one at a time. This enables the table access method to
+   optimize table block access patterns and perform I/O prefetching.
+   By returning all matching index entries from a single index page together,
+   the table AM can readahead through the index and identify which table
+   blocks will be needed, allowing prefetching of table AM pages during
+   ordered index scans.
   </para>
 
   <para>
@@ -1005,7 +1008,8 @@ amunguardbatch (IndexScanDesc scan,
    free the pins at an opportune point (for example when <function>amrescan</function>
    and/or <function>amendscan</function> are called).  It must also keep the
    number of retained pins fixed and small, to avoid exhausting the backend's
-   buffer pin limit.
+   buffer pin limit (which is shared with the table AM's read stream
+   for index scan prefetching).
   </para>
 
   <para>
@@ -1380,6 +1384,60 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
+  <sect2 id="index-scanning-batches">
+   <title>Table AM Considerations for Batch Scanning</title>
+
+   <para>
+    This section is primarily relevant to
+    <link linkend="tableam">table access method</link> authors.
+    When an index scan uses the <function>amgetbatch</function> interface,
+    the table AM is responsible for managing position state within the
+    <structname>IndexScanDesc</structname>'s
+    <structfield>batchringbuf</structfield> and for controlling when
+    buffer pins on index pages are released.
+   </para>
+
+   <para>
+    The <structfield>scanPos</structfield> field within
+    <structfield>batchringbuf</structfield> tracks which batch and item within
+    that batch will be returned next to the executor.  The table AM must advance
+    <structfield>scanPos</structfield> as tuples are returned by
+    <function>table_index_getnext_slot</function>.  The core code may also
+    modify this field during operations such as mark/restore.
+   </para>
+
+   <para>
+    The <structfield>prefetchPos</structfield> field tracks the position used
+    for I/O prefetching.  It is generally advanced by initializing it from
+    <structfield>scanPos</structfield> within a read stream callback, allowing
+    the table AM to prefetch table blocks pointed to by items that are well
+    ahead of the current scan position.  Initially
+    <structfield>prefetchPos</structfield> starts at
+    <structfield>scanPos</structfield>, but as the read stream ramps up it can
+    get far ahead &mdash; spanning multiple index pages if necessary to
+    maintain an optimal I/O prefetch distance for table block reads.  A major
+    goal of the <function>amgetbatch</function> interface is to allow the
+    table AM to prefetch without being limited to items from the current
+    <structfield>scanPos</structfield> batch's index leaf page.
+   </para>
+
+   <para>
+    Both <structfield>scanPos</structfield> and
+    <structfield>prefetchPos</structfield> are controlled by the table AM and
+    core code; index access methods should not access or manipulate these
+    fields.  See the <filename>src/backend/access/heap/</filename>
+    implementation for a reference example.
+   </para>
+
+   <para>
+    For details on buffer pin management during batch scans, including the
+    <structfield>batchImmediateUnguard</structfield> policy and the
+    <function>amunguardbatch</function> callback, see
+    <xref linkend="index-locking"/>.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-locking">
diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
index 9ccf5b739..8e70a6196 100644
--- a/doc/src/sgml/tableam.sgml
+++ b/doc/src/sgml/tableam.sgml
@@ -129,6 +129,14 @@ my_tableam_handler(PG_FUNCTION_ARGS)
   optional), the block number needs to provide locality.
  </para>
 
+ <para>
+  Table access methods can support ordered index scans using the
+  <function>amgetbatch</function> interface. See also
+  <xref linkend="index-scanning-batches"/> for details on interfacing with
+  <function>amgetbatch</function> index access methods, and managing the
+  scan's position.
+ </para>
+
  <para>
   For crash safety, an AM can use postgres' <link
   linkend="wal"><acronym>WAL</acronym></link>, or a custom implementation.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 132b56a58..32bc3dd3e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.53.0



  [application/octet-stream] v20-0005-heapam-Keep-buffer-pins-across-index-rescans.patch (3.9K, 13-v20-0005-heapam-Keep-buffer-pins-across-index-rescans.patch)
  download | inline diff:
From ac49bc288043a28e17d1d0553f2a3f4388169fad Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Thu, 26 Mar 2026 18:45:27 -0400
Subject: [PATCH v20 05/17] heapam: Keep buffer pins across index rescans.

Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
(xs_vmbuffer) during heapam_index_fetch_reset.  Retaining these pins
saves cycles during tight nested loop joins and merge joins that
frequently restore a saved mark, since the next tuple fetched after a
rescan often falls on the same heap page.  It can also avoid repeated
pinning and unpinning of the same buffer when rescans happen to revisit
the same page.

Note that not dropping xs_vmbuffer on a rescan isn't a new behavior
(it's always worked this way).  Recent commit XXX, which added a new
slot-based interface, changed that behavior when it moved VM pin
management out of the core executor.  This commit restores that behavior
(and has heapam treat heap page pins in the same way, which _is_ a new
behavior).

Preparation for an upcoming patch that will add the amgetbatch
interface to enable optimizations such as I/O prefetching.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
---
 src/backend/access/heap/heapam_indexscan.c | 26 ++++++++++++----------
 src/backend/access/index/indexam.c         |  6 ++---
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 459b69eee..b269b802e 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -65,18 +65,14 @@ heapam_index_fetch_reset(IndexScanDesc scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
-	if (BufferIsValid(hscan->xs_cbuf))
-	{
-		ReleaseBuffer(hscan->xs_cbuf);
-		hscan->xs_cbuf = InvalidBuffer;
-		hscan->xs_blk = InvalidBlockNumber;
-	}
+	/* Resets are a no-op */
+	(void) hscan;
 
-	if (BufferIsValid(hscan->xs_vmbuffer))
-	{
-		ReleaseBuffer(hscan->xs_vmbuffer);
-		hscan->xs_vmbuffer = InvalidBuffer;
-	}
+	/*
+	 * Deliberately avoid dropping pins now held in xs_cbuf and xs_vmbuffer.
+	 * This saves cycles during certain tight nested loop joins (it can avoid
+	 * repeated pinning and unpinning of the same buffer across rescans).
+	 */
 }
 
 void
@@ -84,7 +80,13 @@ heapam_index_fetch_end(IndexScanDesc scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
-	heapam_index_fetch_reset(scan);
+	/* drop pin if there's a pinned heap page */
+	if (BufferIsValid(hscan->xs_cbuf))
+		ReleaseBuffer(hscan->xs_cbuf);
+
+	/* drop pin if there's a pinned visibility map page */
+	if (BufferIsValid(hscan->xs_vmbuffer))
+		ReleaseBuffer(hscan->xs_vmbuffer);
 
 	pfree(hscan);
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5d5e6b6a9..f08bc96bd 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -390,7 +390,7 @@ index_rescan(IndexScanDesc scan,
 	Assert(nkeys == scan->numberOfKeys);
 	Assert(norderbys == scan->numberOfOrderBys);
 
-	/* Release resources (like buffer pins) from table accesses */
+	/* reset table AM state for rescan */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan);
 
@@ -467,7 +467,7 @@ index_restrpos(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amrestrpos);
 
-	/* release resources (like buffer pins) from table accesses */
+	/* reset table AM state for restoring the marked position */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan);
 
@@ -667,7 +667,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* If we're out of index entries, we're done */
 	if (!found)
 	{
-		/* release resources (like buffer pins) from table accesses */
+		/* reset table AM state */
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan);
 
-- 
2.53.0



  [application/octet-stream] v20-0006-Add-interfaces-that-enable-index-prefetching.patch (239.7K, 14-v20-0006-Add-interfaces-that-enable-index-prefetching.patch)
  download | inline diff:
From 928aa9664bc4deaca25fa840f8c67f0b3816feb3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 25 Mar 2026 16:48:43 -0400
Subject: [PATCH v20 06/17] Add interfaces that enable index prefetching.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain index scans and index-only scans that return index
entries in batches comprising all matching items from an index page,
rather than one match at a time.  Also switch nbtree over from
amgettuple to the new amgetbatch interface.

This expands the slot-based table AM interface added by commit FIXME
from two callbacks (for amgettuple plain and index-only scans) to four,
adding amgetbatch variants for both scan types.  The amgetbatch
interface is tightly coupled with this approach to index scans: the
table AM can apply knowledge of which TIDs will be returned to the scan
in the near future to perform I/O prefetching.  Prefetching will be
added by an upcoming commit.

With amgetbatch, a scan-level policy determines whether each batch's
index page buffer pin is dropped eagerly by the index AM (for plain
scans with an MVCC snapshot, where the snapshot itself prevents TID
recycling problems) or retained as an interlock against concurrent TID
recycling by VACUUM.  The interlock is retained for non-MVCC scans and
for index-only scans, and is dropped by the table AM via the new
amunguardbatch callback when it is safe to do so. (Actually, index AMs
are usually able to drop the pin at the same time that they release the
lock.  In practice, the amunguardbatch callback is only really needed
during index-only scans, where dropping the pin interlock might need to
be delayed ever so slightly, as explained below.)

This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
and generalizes it to work with all index AMs that support the new
amgetbatch interface (LP_DEAD marking of index entries must be performed
by implementing the new amkillitemsbatch callback, which has a
documented contract describing how index AMs must reason about
concurrent TID recycling).  Scans can always safely drop index page pins
eagerly, provided the scan uses an MVCC snapshot (unlike the nbtree
dropPin optimization, which had no way of doing this safely during
index-only scans due to how amgettuple works, and only gained support
for scans of unlogged relations in recent commit 8a879119).

The old ammarkpos and amrestrpos index AM callbacks are removed.  With
amgetbatch, mark/restore of scan positions is managed by the table AM,
with cooperation from indexbatch.c utility functions, rather than being
delegated to the index AM.  All amgetbatch-capable index AMs inherently
support mark/restore without needing to implement it themselves.  Table
AMs are required to support the new table_index_fetch_restrpos interface.

An upcoming commit that will add index prefetching will use a read
stream to read heap pages during index scans.  Read stream is careful to
limit how many things it pins, lest we run into problems due to having
too many buffers pinned.  Simply never holding on to index page buffer
pins greatly simplifies resource management for index prefetching;
there's no risk of unintended interactions between the read stream and
index AM.  The only downside is that we cannot support prefetching
during scans that use a non-MVCC snapshot, which seems quite acceptable.

In practice, heapam doesn't drop each batch's index page buffer pin at
the earliest opportunity during index-only scans.  This was deemed
necessary to avoid regressing index-only scans with a LIMIT, in
particular with nestloop anti-joins and nestloop semi-joins; eagerly
loading all the visibility information up front regressed such queries.
The new amgetbatch interface gives table AMs the authority to decide
when to drop index page pins/unguard batches, so this can be considered
a heapam implementation detail (index AMs don't need to know about it).
This scheme still allows index prefetching to consistently hold no more
than one batch index page pin at a time, even when an index-only scan
(that must perform some heap fetches) holds open several index batches
at once in order to maintain an adequate prefetch distance.

Index access methods that support plain index scans must now implement
either the amgetbatch interface or the amgettuple interface (not both).
An upcoming patch will add support for amgetbatch to the hash index AM.
But the amgettuple interface will still be used by the GiST and SP-GiST
index AMs for now.  Both share a set of problems that make it unclear
how to go about adding support.

Both AMs reconstruct index data as HeapTuples via heap_form_tuple during
index-only scans, performing retail palloc allocations that are
incompatible with the flat, fixed-size, recyclable per-batch memory
model that amgetbatch's currTuples workspace requires.  Moreover, both
AMs have known bugs involving buffer pin management during index-only
scans: they release index leaf page pins immediately, rather than
holding them as an interlock against concurrent TID recycling by VACUUM,
creating a race condition in which VACUUM can remove a heap tuple and
then mark its page all-visible while the index-only scan still holds a
reference to the now-recycled TID [1].  These index AMs cannot adopt
amgetbatch without first fixing the pin-handling deficiency that they
already have under amgettuple (it's not clear how to fix the problem
within the confines of the current amgettuple design, let alone in a way
that's compatible with amgetbatch).

[1] https://postgr.es/m/CAH2-Wz%3DjjiNL9FCh8C1L-GUH15f4WFTWub2x%2B_NucngcDDcHKw%40mail.gmail.com

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
---
 src/include/access/amapi.h                    |  27 +-
 src/include/access/genam.h                    |   1 +
 src/include/access/heapam.h                   |  44 +-
 src/include/access/indexbatch.h               | 175 ++++
 src/include/access/nbtree.h                   | 184 ++---
 src/include/access/relscan.h                  | 346 +++++++-
 src/include/access/tableam.h                  |  71 +-
 src/include/nodes/pathnodes.h                 |   6 +-
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   6 +-
 src/backend/access/gist/gist.c                |   6 +-
 src/backend/access/hash/hash.c                |   6 +-
 src/backend/access/heap/heapam_handler.c      |   4 +
 src/backend/access/heap/heapam_indexscan.c    | 472 ++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/amapi.c              |   5 +
 src/backend/access/index/genam.c              |   5 +
 src/backend/access/index/indexam.c            |  54 +-
 src/backend/access/index/indexbatch.c         | 771 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |  74 +-
 src/backend/access/nbtree/nbtpage.c           |  13 +-
 src/backend/access/nbtree/nbtreadpage.c       | 207 +++--
 src/backend/access/nbtree/nbtree.c            | 469 ++++++-----
 src/backend/access/nbtree/nbtsearch.c         | 567 ++++++-------
 src/backend/access/nbtree/nbtutils.c          | 245 ------
 src/backend/access/nbtree/nbtxlog.c           |   6 +-
 src/backend/access/spgist/spgutils.c          |   6 +-
 src/backend/access/table/tableamapi.c         |   4 +
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/nodeMergejoin.c          |   4 +-
 src/backend/optimizer/path/indxpath.c         |   6 +-
 src/backend/optimizer/util/plancat.c          |   8 +-
 src/backend/replication/logical/relation.c    |   9 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 contrib/bloom/blutils.c                       |   6 +-
 doc/src/sgml/indexam.sgml                     | 524 ++++++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   6 +-
 src/tools/pgindent/typedefs.list              |  12 +-
 42 files changed, 3092 insertions(+), 1298 deletions(-)
 create mode 100644 src/include/access/indexbatch.h
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index ecfbd017d..9bd3141fc 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,19 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan,
+											   IndexScanBatch priorbatch,
+											   ScanDirection direction);
+
+/* mark dead items in index page */
+typedef void (*amkillitemsbatch_function) (IndexScanDesc scan,
+										   IndexScanBatch batch);
+
+/* drop TID recycling interlock held to prevent concurrent VACUUM recycling */
+typedef void (*amunguardbatch_function) (IndexScanDesc scan,
+										 IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +218,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +320,12 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amkillitemsbatch_function amkillitemsbatch; /* can be NULL */
+	amunguardbatch_function amunguardbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index db62e0ca1..ae587f8de 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -96,6 +96,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a78fc0df2..1c5570ac0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -128,10 +128,40 @@ typedef struct IndexFetchHeapData
 	Buffer		xs_cbuf;
 	BlockNumber xs_blk;
 
-	/* Current heap block's corresponding page in the visibility map */
-	Buffer		xs_vmbuffer;
+	/* For visibility map checks (index-only scans and on-access pruning) */
+	Buffer		xs_vmbuffer;	/* visibility map buffer */
+	int			xs_vm_items;	/* # items to resolve visibility info for */
+
 } IndexFetchHeapData;
 
+/*
+ * Per-batch data private to the heap table AM.
+ *
+ * Stored at a negative offset from the IndexScanBatch pointer, in the
+ * fixed-size table AM opaque area of each batch allocation.
+ */
+typedef struct HeapBatchData
+{
+	uint8	   *visInfo;		/* per-item visibility flags, or NULL */
+} HeapBatchData;
+
+/*
+ * Per-item visibility flags stored in HeapBatchData.visInfo array
+ */
+#define HEAP_BATCH_VIS_CHECKED		0x01	/* checked item in VM? */
+#define HEAP_BATCH_VIS_ALL_VISIBLE	0x02	/* block is known all-visible? */
+
+/*
+ * Access the heap-private fixed-size data from the beginning of an allocated
+ * IndexScanBatch, using caller's IndexScanBatch pointer
+ */
+static inline HeapBatchData *
+heap_batch_data(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* heapam's fixed-size space is at the start of the palloc'd area */
+	return (HeapBatchData *) batch_alloc_base(scan, batch);
+}
+
 /* Result codes for HeapTupleSatisfiesVacuum */
 typedef enum
 {
@@ -432,10 +462,20 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
 /* in heap/heapam_indexscan.c */
 extern IndexFetchTableData *heapam_index_fetch_begin(Relation rel, uint32 flags);
 extern void heapam_index_fetch_reset(IndexScanDesc scan);
+extern void heapam_index_fetch_restrpos(IndexScanDesc scan);
 extern void heapam_index_fetch_end(IndexScanDesc scan);
+extern void heapam_index_fetch_batch_init(IndexScanDesc scan,
+										  IndexScanBatch batch,
+										  bool new_alloc);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
 								   bool *all_dead, bool first_call);
+extern bool heapam_index_plain_amgetbatch_next(IndexScanDesc scan,
+											   ScanDirection direction,
+											   TupleTableSlot *slot);
+extern bool heapam_index_only_amgetbatch_next(IndexScanDesc scan,
+											  ScanDirection direction,
+											  TupleTableSlot *slot);
 extern bool heapam_index_plain_amgettuple_next(IndexScanDesc scan,
 											   ScanDirection direction,
 											   TupleTableSlot *slot);
diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h
new file mode 100644
index 000000000..4265ad7de
--- /dev/null
+++ b/src/include/access/indexbatch.h
@@ -0,0 +1,175 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.h
+ *	  Batch-based index scan infrastructure for the amgetbatch interface.
+ *
+ * Provides functions used by table AMs to manage an index scan's positional
+ * state (stored in IndexScanDesc.batchringbuf), and to manage underlying
+ * resources such as memory and buffer pins.  Also provides various utility
+ * functions used by index AMs for batch resource management.
+ *
+ * This module does not provide elementary operations for manipulating the
+ * scan's ring buffer (e.g., for appending a batch).  Those are implemented as
+ * inline functions defined beside IndexScanDesc and IndexScanBatch.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/indexbatch.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef INDEXBATCH_H
+#define INDEXBATCH_H
+
+#include "access/amapi.h"
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "storage/buf.h"
+#include "utils/rel.h"
+
+/*
+ * utilities called by indexam.c on behalf of table AMs
+ */
+extern void batchscan_init(IndexScanDesc scan);
+extern void batchscan_mark_pos(IndexScanDesc scan);
+
+/*
+ * utilities called by table AMs
+ */
+extern void tableam_util_batchscan_restore_pos(IndexScanDesc scan);
+extern void tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan);
+extern void tableam_util_batchscan_end(IndexScanDesc scan);
+extern void tableam_util_scanbatch_dirchange(IndexScanDesc scan);
+extern void tableam_util_scanpos_killitem(IndexScanDesc scan);
+extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch);
+extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch);
+
+/*
+ * Fetch the next batch of matching items for the scan (or the first).
+ *
+ * Called when caller's current batch (passed to us as priorBatch) has no more
+ * matching items in the given scan direction.  Caller passes a NULL
+ * priorBatch on the first call here for the scan.
+ *
+ * Returns the next batch to be processed by caller in the given scan
+ * direction, or NULL when there are no more matches in that direction.
+ *
+ * This is where batches are appended to the scan's ring buffer.  We don't
+ * free any batches here, though; that is left up to the caller.  The caller
+ * is also responsible for advancing their position.
+ */
+static pg_attribute_always_inline IndexScanBatch
+tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction,
+							  IndexScanBatch priorBatch, BatchRingItemPos *pos)
+{
+	IndexScanBatch batch = NULL;
+	BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf;
+
+	if (!priorBatch)
+	{
+		/* First call for the scan */
+		Assert(pos == &batchringbuf->scanPos);
+	}
+	else if (unlikely(priorBatch->dir != direction))
+	{
+		/*
+		 * We detected a change in scan direction across batches.  Prepare
+		 * scan's batchringbuf state for us to get the next batch for the
+		 * opposite scan direction to the one used when priorBatch was
+		 * returned by amgetbatch.
+		 */
+		tableam_util_scanbatch_dirchange(scan);
+
+		/* priorBatch is now batchringbuf's only batch */
+		Assert(pos->batch == batchringbuf->headBatch);
+		Assert(index_scan_batch_count(scan) == 1);
+	}
+	else if (index_scan_batch_loaded(scan, pos->batch + 1))
+	{
+		/* Next batch already loaded for us */
+		batch = index_scan_batch(scan, pos->batch + 1);
+
+		Assert(priorBatch->dir == direction);
+		Assert(batch->dir == direction);
+		Assert(batch->firstItem <= batch->lastItem);
+		return batch;
+	}
+
+	/*
+	 * Assert preconditions for calling amgetbatch.
+	 *
+	 * priorBatch had better be for the last valid batch currently in the ring
+	 * buffer (batches must stay in scan order).  If it isn't then we should
+	 * have already returned some existing loaded batch earlier.
+	 */
+	Assert(!index_scan_batch_full(scan));
+	Assert(!priorBatch ||
+		   (index_scan_batch_count(scan) > 0 && priorBatch->dir == direction &&
+			index_scan_batch(scan, batchringbuf->nextBatch - 1) == priorBatch));
+
+	/*
+	 * Before we call amgetbatch again, check if priorBatch is already known
+	 * to be the last batch with matching items in this scan direction
+	 */
+	if (priorBatch &&
+		(ScanDirectionIsForward(direction) ?
+		 priorBatch->knownEndForward :
+		 priorBatch->knownEndBackward))
+		return NULL;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorBatch,
+													  direction);
+	if (batch)
+	{
+		/* We got the batch from the index AM */
+		Assert(batch->dir == direction);
+		Assert(batch->firstItem <= batch->lastItem);
+
+		/* Append batch to the end of ring buffer/write it to buffer index */
+		index_scan_batch_append(scan, batch);
+	}
+	else
+	{
+		/* amgetbatch returned NULL */
+		if (priorBatch)
+		{
+			/*
+			 * There are no further matches to be found in the current scan
+			 * direction, following priorBatch.  Remember that priorBatch is
+			 * the last batch with matching items.
+			 */
+			if (ScanDirectionIsForward(direction))
+				priorBatch->knownEndForward = true;
+			else
+				priorBatch->knownEndBackward = true;
+		}
+	}
+
+	/* xs_hitup isn't currently supported by amgetbatch scans */
+	Assert(!scan->xs_hitup);
+
+	return batch;
+}
+
+/*
+ * utilities called by index AMs
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
+									  Buffer buf);
+extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch);
+
+/*
+ * Utility macro for accessing the index AM's per-batch opaque data.
+ *
+ * Each batch allocation places the index AM opaque area at a fixed negative
+ * offset from the IndexScanBatch pointer (see indexam_util_batch_alloc).
+ * This macro returns a typed pointer to that area, asserting that everybody
+ * has the same idea about where the index AM opaque area is in passing.
+ */
+#define indexam_util_batch_get_amdata(scan, batch, type) \
+	(AssertMacro((scan)->batch_index_opaque_size == MAXALIGN(sizeof(type))), \
+	 ((type *) ((char *) (batch) - MAXALIGN(sizeof(type)))))
+
+#endif							/* INDEXBATCH_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index da7503c57..2cdcdefa2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -15,6 +15,7 @@
 #define NBTREE_H
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "access/sdir.h"
 #include "catalog/pg_am_d.h"
@@ -924,111 +925,20 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
+/* Per-batch data private to the btree index AM */
+typedef struct BTBatchData
 {
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
+	Buffer		buf;			/* index page buffer pin */
+	BlockNumber currPage;		/* index page with matching items */
+	BlockNumber prevPage;		/* currPage's left sibling */
+	BlockNumber nextPage;		/* currPage's right sibling */
+	bool		moreLeft;		/* more matching pages to the left? */
+	bool		moreRight;		/* more matching pages to the right? */
+} BTBatchData;
 
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+/* Access the btree-private per-batch data from an IndexScanBatch pointer */
+#define BTBatchGetData(scan, batch) \
+	indexam_util_batch_get_amdata(scan, batch, BTBatchData)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1050,6 +960,28 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +998,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1066,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan,
+								 IndexScanBatch priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void btunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch batch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1180,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,15 +1197,15 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access, bool returnstack);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
  * prototypes for functions in nbtutils.c
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
-extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 0ff158d5d..e23540851 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,6 +126,12 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
  */
 typedef struct IndexFetchTableData
 {
+	/* Table AM per-batch opaque area size (MAXALIGN'd), set by AM */
+	uint16		batch_opaque_size;
+
+	/* Per-item trailing data size in each batch */
+	uint16		batch_per_item_size;
+
 	/*
 	 * Bitmask of ScanOptions affecting the relation. No SO_INTERNAL_FLAGS are
 	 * permitted.
@@ -133,13 +139,186 @@ typedef struct IndexFetchTableData
 	uint32		flags;
 } IndexFetchTableData;
 
+/*
+ * Location of a BatchMatchingItem within the scan's ring buffer
+ */
+typedef struct BatchRingItemPos
+{
+	/* Position references a valid IndexScanDescData.batchbuf[] entry? */
+	bool		valid;
+
+	/* IndexScanDescData.batchbuf[]-wise index to relevant IndexScanBatch */
+	uint8		batch;
+
+	/* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+
+} BatchRingItemPos;
+
+/*
+ * Matching item returned by amgetbatch (in returned IndexScanBatch) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData tableTid;	/* TID of referenced table item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* index tuple's currTuples offset, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans.
+ *
+ * Each batch allocation has the following memory layout:
+ *
+ *   [table AM opaque area]    <- fixed-size, -(batch_table_offset) from base
+ *   [index AM opaque area]    <- at -(batch_index_opaque_size) from base
+ *   [IndexScanBatchData]      <- base pointer, returned by amgetbatch
+ *   [items[maxitemsbatch]]
+ *   [table AM trailing data]  <- per-item area (e.g., for visibility info)
+ *   [currTuples workspace]    <- index AM stores index tuples here for
+ *                                index-only scans (batch_tuples_workspace)
+ *
+ * batch_table_offset combines both AM opaque sizes into a single offset from
+ * the batch pointer to the true allocation base.  We use batch_alloc_base to
+ * pfree a batch.  We rely on the assumption that batches have a fixed layout
+ * for the duration of an index scan (since batches are cached for reuse).
+ *
+ * The table AM can overlay a small fixed-size struct at the start of the
+ * allocated space, which it accesses using a batch_alloc_base shim accessor
+ * function.  Convention for table AMs is to store a pointer to its per-item
+ * area in this fixed-size area (e.g., heapam stores a visInfo pointer here),
+ * in addition to anything else that gets tracked at the batch level.
+ *
+ * The index AM opaque area is accessed via a custom accessor that uses a
+ * fixed compile-time constant offset for efficiency (a constant that is
+ * tracked in the scan descriptor as batch_index_opaque_size).
+ */
+typedef struct IndexScanBatchData
+{
+	/* Index page's LSN, optionally used by amkillitemsbatch routines */
+	XLogRecPtr	lsn;
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * knownEndBackward and knownEndForward indicate that this batch is the
+	 * last one with matching items in the relevant scan direction.  When
+	 * amgetbatch returns NULL for a given direction, the corresponding flag
+	 * is set on the priorbatch that was passed to that call.  We cannot know
+	 * this when a batch is first returned by amgetbatch; it only becomes
+	 * apparent when we try and fail to continue the scan past it.
+	 *
+	 * This allows table AMs to avoid redundant amgetbatch calls with the same
+	 * priorbatch -- the index AM might need to read additional index pages to
+	 * determine there are no more matching items beyond caller's priorbatch.
+	 */
+	bool		knownEndBackward;
+	bool		knownEndForward;
+
+	/*
+	 * Batch still holds TID recycling interlock?
+	 */
+	bool		isGuarded;
+
+	/*
+	 * Matching items state for this batch.  Output by index AM for table AM.
+	 *
+	 * The items array is always ordered in index order (ie, by increasing
+	 * indexoffset).  When scanning backwards it is convenient for index AMs
+	 * to fill the array back-to-front, starting at the last item slot and
+	 * filling downwards.  This is why we need both a first-valid-entry and a
+	 * last-valid-entry counter.
+	 *
+	 * Note: these are signed because it's sometimes convenient to use -1 to
+	 * represent an out-of-bounds space just before firstItem (when it's 0).
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about dead items, if any (palloc'd separately, NULL if unused) */
+	int			numDead;		/* number of currently stored items */
+	int		   *deadItems;		/* items[]-wise indexes of dead items */
+
+	/*
+	 * If we are doing an index-only scan, this is the tuple storage workspace
+	 * for the matching tuples (tuples referenced by items[]).  The workspace
+	 * size is determined by the index AM (batch_tuples_workspace).
+	 *
+	 * currTuples points into the trailing portion of this allocation, past
+	 * items[] and any table AM trailing data.  It is NULL for plain index
+	 * scans.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */
+} IndexScanBatchData;
+
+typedef struct IndexScanBatchData *IndexScanBatch;
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans use a ring buffer of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * which is the natural order for the index AM and the order that we require
+ * matches to be returned in.  This is also the order that
+ * table_index_getnext_slot returns matches in.  However, table AMs are free
+ * to fetch table tuples in whatever order is most convenient -- provided that
+ * such reordering cannot affect the order that table_index_getnext_slot later
+ * returns tuples in.
+ */
+typedef struct BatchRingBuffer
+{
+	/* current positions in IndexScanDescData.batchbuf[] for scan */
+	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos markPos;	/* mark/restore position */
+
+	/* markPos's batch (not in ring buffer when markBatch != scanBatch) */
+	IndexScanBatch markBatch;
+
+	/*
+	 * headBatch is an index to the earliest still-valid ring buffer batch
+	 * slot in batchbuf[].  The actual array position for its IndexScanBatch
+	 * is headBatch & (INDEX_SCAN_MAX_BATCHES - 1), since these indexes use
+	 * unsigned wrapping arithmetic.  headBatch must be the scan's current
+	 * scanBatch (i.e. the current scanPos batch).
+	 */
+	uint8		headBatch;
+
+	/*
+	 * nextBatch is an index to the next _empty_ ring buffer batch slot in
+	 * batchbuf[].  As with headBatch, the actual batchbuf[] array position is
+	 * nextBatch & (INDEX_SCAN_MAX_BATCHES - 1).  A new batch can only be
+	 * appended to this position/slot when !index_scan_batch_full().
+	 *
+	 * Note: the scan's most recently appended batch (its tail batch) is
+	 * always located at (nextBatch - 1) & (INDEX_SCAN_MAX_BATCHES - 1).
+	 */
+	uint8		nextBatch;
+} BatchRingBuffer;
+
 struct IndexScanInstrumentation;
 
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
- * amgettuple-based scans.
+ * amgettuple-based scans.  Others are only used in amgetbatch-based scans.
+ *
+ * The ring buffer used by amgetbatch scans is stored here as a fixed array of
+ * pointers to batches.  We need a minimum of two (but use
+ * INDEX_SCAN_MAX_BATCHES), since we'll only consider releasing one batch
+ * when another is read.
  */
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_MAX_BATCHES		64
+
+StaticAssertDecl(INDEX_SCAN_MAX_BATCHES <= PG_UINT8_MAX + 1,
+				 "INDEX_SCAN_MAX_BATCHES must fit in uint8 ring buffer indexes");
+StaticAssertDecl((INDEX_SCAN_MAX_BATCHES & (INDEX_SCAN_MAX_BATCHES - 1)) == 0,
+				 "INDEX_SCAN_MAX_BATCHES must be a power of 2");
+
 typedef struct IndexScanDescData
 {
 	/* scan parameters */
@@ -150,6 +329,26 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* scan's amgetbatch state (only used by amgetbatch/usebatchring scans) */
+	BatchRingBuffer batchringbuf;
+
+	/*
+	 * Array of pointers to recyclable batches, used by all amgetbatch scans
+	 * and by amgetbitmap scans of an index AM that supports amgetbatch
+	 */
+	IndexScanBatch batchcache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to batches, referenced within batchringbuf */
+	IndexScanBatch batchbuf[INDEX_SCAN_MAX_BATCHES];
+
+	bool		usebatchring;	/* scan uses amgetbatch/batchringbuf? */
+	bool		batchImmediateUnguard;	/* eagerly drop TID recycling
+										 * interlock? */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -158,9 +357,8 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
-
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/* xs_snapshot uses an MVCC snapshot? */
+	bool		MVCCScan;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -191,6 +389,14 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* batch size information, set once by index AM in ambeginscan */
+	uint16		maxitemsbatch;	/* size of each batch's items[] array */
+	uint16		batch_index_opaque_size;	/* MAXALIGN'd index AM opaque size */
+	uint16		batch_tuples_workspace; /* currTuples workspace size */
+
+	/* Computed offset, used to get table AM's opaque area from a batch */
+	uint16		batch_table_offset;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
@@ -234,4 +440,136 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Return the true allocation base of a batch (accounting for AM opaque areas
+ * stored before the IndexScanBatchData pointer).
+ */
+static inline void *
+batch_alloc_base(IndexScanDescData *scan, IndexScanBatch batch)
+{
+	return (char *) batch - scan->batch_table_offset;
+}
+
+/*
+ * Count how many batches are currently loaded in the ring buffer.
+ */
+static inline uint8
+index_scan_batch_count(IndexScanDescData *scan)
+{
+	return (uint8) (scan->batchringbuf.nextBatch -
+					scan->batchringbuf.headBatch);
+}
+
+/*
+ * Do we already have a batch loaded at 'idx' offset in scan's ring buffer?
+ *
+ * NOTE: a stale batch idx can alias a currently-loaded range after uint8
+ * overflow, producing a false positive.  False negatives are not possible.
+ */
+static inline bool
+index_scan_batch_loaded(IndexScanDescData *scan, uint8 idx)
+{
+	return (int8) (idx - scan->batchringbuf.headBatch) >= 0 &&
+		(int8) (idx - scan->batchringbuf.nextBatch) < 0;
+}
+
+/*
+ * Have we loaded the maximum number of batches?
+ */
+static inline bool
+index_scan_batch_full(IndexScanDescData *scan)
+{
+	return index_scan_batch_count(scan) == INDEX_SCAN_MAX_BATCHES;
+}
+
+/*
+ * Return batch for the provided index.
+ */
+static inline IndexScanBatch
+index_scan_batch(IndexScanDescData *scan, uint8 idx)
+{
+	Assert(index_scan_batch_loaded(scan, idx));
+
+	return scan->batchbuf[idx & (INDEX_SCAN_MAX_BATCHES - 1)];
+}
+
+/*
+ * Append given batch to scan's batch ring buffer.
+ */
+static inline void
+index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)
+{
+	BatchRingBuffer *ringbuf = &scan->batchringbuf;
+	uint8		nextBatch = ringbuf->nextBatch;
+
+	Assert(!index_scan_batch_full(scan));
+
+	scan->batchbuf[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch;
+	ringbuf->nextBatch++;
+}
+
+/*
+ * Advance position to its next item in the batch.
+ *
+ * Advance to the next item within the provided batch (or to the previous item,
+ * when scanning backwards).
+ *
+ * Returns true if the position could be advanced.  Returns false when there
+ * are no more items from the batch remaining in the given scan direction.
+ */
+static inline bool
+index_scan_pos_advance(ScanDirection direction,
+					   IndexScanBatch batch, BatchRingItemPos *pos)
+{
+	Assert(pos->valid);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item > batch->lastItem)
+			return false;
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item < batch->firstItem)
+			return false;
+	}
+
+	/* Advanced within batch */
+	return true;
+}
+
+/*
+ * Advance batch position to the start of its new batch.
+ *
+ * When we're called, this position should point to a batch that caller just
+ * finished consuming from.  When we return, this position will point to
+ * nextBatch, the next batch from the ring buffer.  We'll have also set the
+ * position's item offset to nextBatch's first item in the given direction
+ * (which is actually nextBatch's _last_ item when scanning backwards).
+ *
+ * nextBatch doesn't have to be (and often isn't) the most recently appended
+ * batch in the scan's ring buffer.  It is merely the next batch in line to be
+ * consumed from the point of view of our caller.
+ */
+static inline void
+index_scan_pos_nextbatch(ScanDirection direction,
+						 IndexScanBatch nextBatch, BatchRingItemPos *pos)
+{
+	Assert(nextBatch->dir == direction);
+	Assert(nextBatch->firstItem <= nextBatch->lastItem);
+
+	/* Increment batch (might wrap), or initialize it to zero */
+	if (pos->valid)
+		pos->batch++;
+	else
+		pos->batch = 0;
+
+	pos->valid = true;
+
+	if (ScanDirectionIsForward(direction))
+		pos->item = nextBatch->firstItem;
+	else
+		pos->item = nextBatch->lastItem;
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4875d70ad..532923dbb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -443,16 +443,22 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch tuples from the relation, as needed when fetching
-	 * tuples for an index scan.  The callback has to return an
-	 * IndexFetchTableData, which the AM will typically embed in a larger
-	 * structure with additional information.
+	 * Prepare to fetch tuples, as needed when fetching tuples for an index
+	 * scan.  The callback has to return an IndexFetchTableData, which the AM
+	 * will typically embed in a larger structure with additional information.
 	 *
 	 * flags is a bitmask of ScanOptions affecting underlying table scan
 	 * behavior. See scan_begin() for more information on passing these.
 	 *
-	 * Tuples for an index scan can then be fetched via one of the
+	 * Tuples for an index scan can then be fetched via one of the four
 	 * slot-based callbacks called through table_index_getnext_slot.
+	 *
+	 * Callback must initialize the batch_opaque_size and batch_per_item_size
+	 * fields in the returned struct, to let the core code know how much
+	 * memory will be required in the opaque table AM portions of each batch
+	 * allocation.  These are the batches used during amgetbatch index scans,
+	 * which table AMs can use to cache things like per-item visibility
+	 * information.
 	 */
 	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
 
@@ -467,14 +473,34 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (IndexScanDesc scan);
 
+	/*
+	 * Initialize table AM's per-batch opaque area within a batch allocation.
+	 *
+	 * Called by indexam_util_batch_alloc for each new or recycled batch.
+	 * Table AMs should set up its opaque area (at a negative offset from the
+	 * batch pointer) and any trailing per-item data (e.g. visibility flags).
+	 *
+	 * 'new_alloc' is true for freshly palloc'd batches, false for batches
+	 * recycled from the cache.
+	 */
+	void		(*index_fetch_batch_init) (IndexScanDesc scan,
+										   IndexScanBatch batch,
+										   bool new_alloc);
+
 	/*
 	 * Fetch the next tuple from an index scan, scanning in the specified
 	 * direction, and return true if a tuple was found, false otherwise.
 	 *
-	 * Two variants cover {plain, index-only} index scans that use amgettuple.
-	 * index_beginscan resolves which variant to use.  Callers use
+	 * Four variants cover the {plain, index-only} x {amgetbatch, amgettuple}
+	 * matrix.  index_beginscan resolves which variant to use.  Callers use
 	 * table_index_getnext_slot(), which calls through that pointer directly.
 	 */
+	bool		(*index_plain_amgetbatch_next) (IndexScanDesc scan,
+												ScanDirection direction,
+												TupleTableSlot *slot);
+	bool		(*index_only_amgetbatch_next) (IndexScanDesc scan,
+											   ScanDirection direction,
+											   TupleTableSlot *slot);
 	bool		(*index_plain_amgettuple_next) (IndexScanDesc scan,
 												ScanDirection direction,
 												TupleTableSlot *slot);
@@ -505,6 +531,11 @@ typedef struct TableAmRoutine
 							  TupleTableSlot *slot,
 							  bool *all_dead);
 
+	/*
+	 * Restore a previously marked scan position
+	 */
+	void		(*index_fetch_restrpos) (IndexScanDesc scan);
+
 
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
@@ -1279,6 +1310,17 @@ table_index_fetch_reset(IndexScanDesc scan)
 	scan->heapRelation->rd_tableam->index_fetch_reset(scan);
 }
 
+/*
+ * Restore a previously marked scan position
+ */
+static inline void
+table_index_fetch_restrpos(IndexScanDesc scan)
+{
+	Assert(scan->xs_heapfetch);
+
+	scan->heapRelation->rd_tableam->index_fetch_restrpos(scan);
+}
+
 /*
  * Release resources and deallocate index fetch held in the scan's underlying
  * IndexFetchTableData.
@@ -1291,6 +1333,21 @@ table_index_fetch_end(IndexScanDesc scan)
 	scan->heapRelation->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Initialize table AM's per-batch opaque area within a batch allocation.
+ *
+ * Called by indexam_util_batch_alloc for each new or recycled batch.
+ */
+static inline void
+table_index_fetch_batch_init(IndexScanDesc scan, IndexScanBatch batch,
+							 bool new_alloc)
+{
+	Assert(scan->xs_heapfetch);
+
+	scan->heapRelation->rd_tableam->index_fetch_batch_init(scan, batch,
+														   new_alloc);
+}
+
 /*
  * Fetch the next tuple from an index scan into `slot`, scanning in the
  * specified direction.  Returns true if a tuple satisfying the scan keys and
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 693b879f7..85991d447 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1437,12 +1437,12 @@ typedef struct IndexOptInfo
 	bool		amoptionalkey;
 	bool		amsearcharray;
 	bool		amsearchnulls;
-	/* does AM have amgettuple interface? */
-	bool		amhasgettuple;
+	/* does AM have amgetbatch (or gettuple) interface? */
+	bool		amcanplainscan;
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e..62a826d4f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -298,10 +298,12 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index fe7b984ff..710f3f9c2 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,12 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8565e225b..a484c8b2a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,12 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 8d8cd30dc..2e32be233 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -114,10 +114,12 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f96b42709..a9439d02e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2557,8 +2557,12 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_fetch_batch_init = heapam_index_fetch_batch_init,
+	.index_plain_amgetbatch_next = heapam_index_plain_amgetbatch_next,
+	.index_only_amgetbatch_next = heapam_index_only_amgetbatch_next,
 	.index_plain_amgettuple_next = heapam_index_plain_amgettuple_next,
 	.index_only_amgettuple_next = heapam_index_only_amgettuple_next,
+	.index_fetch_restrpos = heapam_index_fetch_restrpos,
 	.fetch_tid = heapam_fetch_tid,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index b269b802e..885c25c67 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/heapam.h"
+#include "access/indexbatch.h"
 #include "access/relscan.h"
 #include "access/visibilitymap.h"
 #include "storage/predicate.h"
@@ -32,11 +33,28 @@ static pg_attribute_always_inline bool heapam_index_fetch_tuple_impl(Relation re
 static pg_attribute_always_inline bool heapam_index_getnext_slot(IndexScanDesc scan,
 																 ScanDirection direction,
 																 TupleTableSlot *slot,
-																 bool index_only);
+																 bool index_only,
+																 bool amgetbatch);
 static pg_attribute_always_inline bool heapam_index_fetch_heap(IndexScanDesc scan,
 															   IndexFetchHeapData *hscan,
 															   TupleTableSlot *slot,
-															   bool *heap_continue);
+															   bool *heap_continue,
+															   bool amgetbatch);
+static pg_attribute_always_inline ItemPointer heapam_index_getnext_scanbatch_pos(IndexScanDesc scan,
+																				 IndexFetchHeapData *hscan,
+																				 ScanDirection direction,
+																				 bool *all_visible);
+static inline ItemPointer heapam_index_return_scanpos_tid(IndexScanDesc scan,
+														  IndexFetchHeapData *hscan,
+														  ScanDirection direction,
+														  IndexScanBatch scanBatch,
+														  BatchRingItemPos *scanPos,
+														  bool *all_visible);
+static void heapam_index_batch_pos_visibility(IndexScanDesc scan,
+											  ScanDirection direction,
+											  IndexScanBatch batch,
+											  HeapBatchData *hbatch,
+											  BatchRingItemPos *pos);
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -48,14 +66,23 @@ heapam_index_fetch_begin(Relation rel, uint32 flags)
 {
 	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
 
+	hscan->xs_base.batch_opaque_size = MAXALIGN(sizeof(HeapBatchData));
+	hscan->xs_base.batch_per_item_size = sizeof(uint8); /* visInfo element size */
 	hscan->xs_base.flags = flags;
-	hscan->xs_cbuf = InvalidBuffer;
+
+	/* Current heap block state */
+	Assert(hscan->xs_cbuf == InvalidBuffer);
 	hscan->xs_blk = InvalidBlockNumber;
-	hscan->xs_vmbuffer = InvalidBuffer;
+
+	/* VM related state */
+	Assert(hscan->xs_vmbuffer == InvalidBuffer);
+	hscan->xs_vm_items = 1;
 
 	/*
 	 * Return opaque state, which we'll access through the scan's xs_heapfetch
-	 * field later on
+	 * field later on.
+	 *
+	 * Note: indexam.c will call batchscan_init for us.
 	 */
 	return &hscan->xs_base;
 }
@@ -65,8 +92,12 @@ heapam_index_fetch_reset(IndexScanDesc scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
-	/* Resets are a no-op */
-	(void) hscan;
+	/* Rescans should avoid an excessive number of VM lookups */
+	hscan->xs_vm_items = 1;
+
+	/* Reset batch ring buffer state */
+	if (scan->usebatchring)
+		tableam_util_batchscan_reset(scan, false);
 
 	/*
 	 * Deliberately avoid dropping pins now held in xs_cbuf and xs_vmbuffer.
@@ -75,6 +106,17 @@ heapam_index_fetch_reset(IndexScanDesc scan)
 	 */
 }
 
+void
+heapam_index_fetch_restrpos(IndexScanDesc scan)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+
+	(void) hscan;
+
+	/* Restore batch ring to previously saved mark */
+	tableam_util_batchscan_restore_pos(scan);
+}
+
 void
 heapam_index_fetch_end(IndexScanDesc scan)
 {
@@ -88,9 +130,52 @@ heapam_index_fetch_end(IndexScanDesc scan)
 	if (BufferIsValid(hscan->xs_vmbuffer))
 		ReleaseBuffer(hscan->xs_vmbuffer);
 
+	/* Free all batch related resources */
+	if (scan->usebatchring)
+		tableam_util_batchscan_end(scan);
+
 	pfree(hscan);
 }
 
+/*
+ * Initialize the heap table AM's per-batch opaque area (HeapBatchData).
+ *
+ * Called by indexam_util_batch_alloc for each new or recycled batch.
+ * Sets up the visInfo pointer for index-only scans, or NULL otherwise.
+ */
+void
+heapam_index_fetch_batch_init(IndexScanDesc scan, IndexScanBatch batch,
+							  bool new_alloc)
+{
+	HeapBatchData *hbatch = heap_batch_data(scan, batch);
+
+	if (scan->xs_want_itup)
+	{
+		if (new_alloc)
+		{
+			/*
+			 * The visInfo pointer is stored at the very start of the palloc'd
+			 * space, in the fixed-sized table AM opaque area.  visInfo points
+			 * to just past the end of the variable-sized items[maxitemsbatch]
+			 * array (to a space that is also sized according to whatever the
+			 * index AM set maxitemsbatch to).
+			 */
+			Size		itemsEnd;
+
+			itemsEnd = MAXALIGN(offsetof(IndexScanBatchData, items) +
+								sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+			hbatch->visInfo = (uint8 *) ((char *) batch + itemsEnd);
+		}
+
+		/* Clear visibility flags (needed for both new and recycled batches) */
+		memset(hbatch->visInfo, 0, scan->maxitemsbatch);
+	}
+	else
+	{
+		hbatch->visInfo = NULL;
+	}
+}
+
 /*
  *	heap_hot_search_buffer	- search HOT chain for tuple satisfying snapshot
  *
@@ -253,16 +338,40 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	return false;
 }
 
+/* table_index_getnext_slot callback: amgetbatch, plain index scan */
+pg_attribute_hot bool
+heapam_index_plain_amgetbatch_next(IndexScanDesc scan,
+								   ScanDirection direction,
+								   TupleTableSlot *slot)
+{
+	Assert(!scan->xs_want_itup && scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, false, true);
+}
+
+/* table_index_getnext_slot callback: amgetbatch, index-only scan */
+pg_attribute_hot bool
+heapam_index_only_amgetbatch_next(IndexScanDesc scan,
+								  ScanDirection direction,
+								  TupleTableSlot *slot)
+{
+	Assert(scan->xs_want_itup && scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, true, true);
+}
+
 /* table_index_getnext_slot callback: amgettuple, plain index scan */
 pg_attribute_hot bool
 heapam_index_plain_amgettuple_next(IndexScanDesc scan,
 								   ScanDirection direction,
 								   TupleTableSlot *slot)
 {
-	Assert(!scan->xs_want_itup);
+	Assert(!scan->xs_want_itup && !scan->usebatchring);
 	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
 
-	return heapam_index_getnext_slot(scan, direction, slot, false);
+	return heapam_index_getnext_slot(scan, direction, slot, false, false);
 }
 
 /* table_index_getnext_slot callback: amgettuple, index-only scan */
@@ -271,10 +380,10 @@ heapam_index_only_amgettuple_next(IndexScanDesc scan,
 								  ScanDirection direction,
 								  TupleTableSlot *slot)
 {
-	Assert(scan->xs_want_itup);
+	Assert(scan->xs_want_itup && !scan->usebatchring);
 	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
 
-	return heapam_index_getnext_slot(scan, direction, slot, true);
+	return heapam_index_getnext_slot(scan, direction, slot, true, false);
 }
 
 /*
@@ -385,7 +494,7 @@ heapam_index_fetch_tuple_impl(Relation rel,
 }
 
 /*
- * Common implementation for both heapam_index_*_getnext_slot variants.
+ * Common implementation for all four heapam_index_*_getnext_slot variants.
  *
  * The result is true if a tuple satisfying the scan keys and the snapshot was
  * found, false otherwise.  The tuple is stored in the specified slot.
@@ -394,12 +503,13 @@ heapam_index_fetch_tuple_impl(Relation rel,
  * dropped by a future call here (or by a later call to heapam_index_fetch_end
  * through index_endscan).
  *
- * The index_only parameter is a compile-time constant at each call site,
- * allowing the compiler to specialize the code for each variant.
+ * The index_only and amgetbatch parameters are compile-time constants at each
+ * call site, allowing the compiler to specialize the code for each variant:
  */
 static pg_attribute_always_inline bool
 heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-						  TupleTableSlot *slot, bool index_only)
+						  TupleTableSlot *slot, bool index_only,
+						  bool amgetbatch)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 	bool	   *heap_continue = &scan->xs_heap_continue;
@@ -413,14 +523,20 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 		if (!*heap_continue)
 		{
 			/* Get the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			if (amgetbatch)
+				tid = heapam_index_getnext_scanbatch_pos(scan, hscan,
+														 direction,
+														 index_only ?
+														 &all_visible : NULL);
+			else
+				tid = index_getnext_tid(scan, direction);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
 				break;
 
-			/* For index-only scans, check the visibility map */
-			if (index_only)
+			/* For non-batch index-only scans, check the visibility map */
+			if (index_only && !amgetbatch)
 				all_visible = VM_ALL_VISIBLE(scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid),
 											 &hscan->xs_vmbuffer);
@@ -445,7 +561,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 					scan->instrument->ntablefetches++;
 
 				if (!heapam_index_fetch_heap(scan, hscan, slot,
-											 heap_continue))
+											 heap_continue, amgetbatch))
 				{
 					/*
 					 * No visible tuple.  If caller set a visited-pages limit
@@ -477,7 +593,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 				 * want us to assume that just having one visible tuple in the
 				 * hot chain is always good enough.
 				 */
-				Assert(!(*heap_continue && IsMVCCSnapshot(scan->xs_snapshot)));
+				Assert(!(*heap_continue && scan->MVCCScan));
 			}
 			else
 			{
@@ -504,7 +620,8 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 			 * entry.  If we don't find anything, loop around and grab the
 			 * next TID from the index.
 			 */
-			if (heapam_index_fetch_heap(scan, hscan, slot, heap_continue))
+			if (heapam_index_fetch_heap(scan, hscan, slot, heap_continue,
+										amgetbatch))
 				return true;
 		}
 	}
@@ -526,7 +643,8 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
  */
 static pg_attribute_always_inline bool
 heapam_index_fetch_heap(IndexScanDesc scan, IndexFetchHeapData *hscan,
-						TupleTableSlot *slot, bool *heap_continue)
+						TupleTableSlot *slot, bool *heap_continue,
+						bool amgetbatch)
 {
 	bool		all_dead = false;
 	bool		found;
@@ -540,14 +658,312 @@ heapam_index_fetch_heap(IndexScanDesc scan, IndexFetchHeapData *hscan,
 		pgstat_count_heap_fetch(scan->indexRelation);
 
 	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (amgetbatch)
+		{
+			if (all_dead)
+				tableam_util_scanpos_killitem(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
 
 	return found;
 }
+
+/*
+ * Get next TID from batch ring buffer, moving in the given scan direction.
+ * Also sets *all_visible for item when caller passes a non-NULL arg.
+ */
+static pg_attribute_always_inline ItemPointer
+heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan,
+								   ScanDirection direction, bool *all_visible)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	IndexScanBatch scanBatch = NULL;
+	bool		hadExistingScanBatch;
+
+	Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch);
+	Assert(scanPos->valid || index_scan_batch_count(scan) == 0);
+	Assert(all_visible == NULL || scan->xs_want_itup);
+
+	/*
+	 * Check if there's an existing loaded scanBatch for us to return the next
+	 * matching item's TID/index tuple from
+	 */
+	hadExistingScanBatch = scanPos->valid;
+	if (scanPos->valid)
+	{
+		/*
+		 * scanPos is valid, so scanBatch must already be loaded in batch ring
+		 * buffer.  We rely on that here.
+		 */
+		pg_assume(batchringbuf->headBatch == scanPos->batch);
+
+		scanBatch = index_scan_batch(scan, scanPos->batch);
+
+		if (index_scan_pos_advance(direction, scanBatch, scanPos))
+			return heapam_index_return_scanpos_tid(scan, hscan, direction,
+												   scanBatch, scanPos,
+												   all_visible);
+	}
+
+	/*
+	 * Either ran out of items from our existing scanBatch, or it hasn't been
+	 * loaded yet (because this is the first call here for the entire scan).
+	 * Try to advance scanBatch to the next batch (or get the first batch).
+	 */
+	scanBatch = tableam_util_fetch_next_batch(scan, direction,
+											  scanBatch, scanPos);
+
+	if (!scanBatch)
+	{
+		/*
+		 * We're done; no more batches in the current scan direction.
+		 *
+		 * Note: scanPos is generally still valid at this point.  The scan
+		 * might still back up in the other direction.
+		 */
+		return NULL;
+	}
+
+	/*
+	 * Advanced scanBatch.  Now position scanPos to the start of new
+	 * scanBatch.
+	 */
+	index_scan_pos_nextbatch(direction, scanBatch, scanPos);
+	Assert(index_scan_batch(scan, scanPos->batch) == scanBatch);
+
+	/*
+	 * Remove the head batch from the batch ring buffer (except when this new
+	 * scanBatch is our only one)
+	 */
+	if (hadExistingScanBatch)
+	{
+		IndexScanBatch headBatch = index_scan_batch(scan,
+													batchringbuf->headBatch);
+
+		Assert(headBatch != scanBatch);
+		Assert(batchringbuf->headBatch != scanPos->batch);
+
+		/* free obsolescent head batch (unless it is scan's markBatch) */
+		tableam_util_free_batch(scan, headBatch);
+
+		/* Remove the batch from the ring buffer (even if it's markBatch) */
+		batchringbuf->headBatch++;
+	}
+
+	/* In practice scanBatch will always be the ring buffer's headBatch */
+	Assert(batchringbuf->headBatch == scanPos->batch);
+
+	return heapam_index_return_scanpos_tid(scan, hscan, direction,
+										   scanBatch, scanPos, all_visible);
+}
+
+/*
+ * Save the current scanPos/scanBatch item's TID in scan's xs_heaptid, and
+ * return a pointer to that TID.  When all_visible isn't NULL (during an
+ * index-only scan), also sets item's visibility status in *all_visible.
+ *
+ * heapam_index_getnext_scanbatch_pos helper function.
+ */
+static inline ItemPointer
+heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
+								ScanDirection direction,
+								IndexScanBatch scanBatch,
+								BatchRingItemPos *scanPos,
+								bool *all_visible)
+{
+	HeapBatchData *hbatch;
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Set xs_heaptid, which caller (and core executor) will need */
+	scan->xs_heaptid = scanBatch->items[scanPos->item].tableTid;
+
+	if (all_visible == NULL)
+	{
+		/*
+		 * Plain index scan.
+		 */
+		Assert(!scan->xs_want_itup);
+		return &scan->xs_heaptid;
+	}
+
+	/*
+	 * Index-only scan.
+	 *
+	 * Also set xs_itup, which caller also needs.
+	 */
+	Assert(scan->xs_want_itup);
+	scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
+								  scanBatch->items[scanPos->item].tupleOffset);
+
+	/*
+	 * Set visibility info for the current scanPos item (plus possibly some
+	 * additional items in the current scan direction) as needed
+	 */
+	hbatch = heap_batch_data(scan, scanBatch);
+	if (!(hbatch->visInfo[scanPos->item] & HEAP_BATCH_VIS_CHECKED))
+		heapam_index_batch_pos_visibility(scan, direction, scanBatch, hbatch,
+										  scanPos);
+
+	/* Finally, set all_visible for caller */
+	*all_visible =
+		(hbatch->visInfo[scanPos->item] & HEAP_BATCH_VIS_ALL_VISIBLE) != 0;
+
+	return &scan->xs_heaptid;
+}
+
+/*
+ * Obtain visibility information for a TID from caller's batch.
+ *
+ * Called during amgetbatch index-only scans.  We always check the visibility
+ * of caller's item (an offset into caller's batch->items[] array).  We might
+ * also set visibility info for other items from caller's batch more
+ * proactively when that makes sense.
+ *
+ * We keep two competing considerations in balance when determining whether to
+ * check additional items: the need to keep the cost of visibility map access
+ * under control when most items will never be returned by the scan anyway
+ * (important for inner index scans of anti-joins and semi-joins), and the
+ * need to unguard batches promptly.
+ *
+ * Once we've resolved visibility for all items in a batch, we can safely
+ * unguard it by calling amunguardbatch.  This is safe with respect to
+ * concurrent VACUUM because the batch's guard (typically a buffer pin on the
+ * originating index page) blocks VACUUM from acquiring a conflicting cleanup
+ * lock on that page.  Copying the relevant visibility map data into our local
+ * cache suffices to prevent unsafe concurrent TID recycling: if any of these
+ * TIDs point to dead heap tuples, VACUUM cannot possibly return from
+ * ambulkdelete and mark the pointed-to heap pages as all-visible.  VACUUM
+ * _can_ do so once the batch is unguarded, but that's okay; we'll be working
+ * off of cached visibility info that indicates that the dead TIDs are NOT
+ * all-visible.
+ *
+ * What about the opposite case, where a page was all-visible when we cached
+ * the VM bits but tuples on it are deleted afterwards?  That is safe too: any
+ * tuple that was visible to all when we read the VM must also be visible to
+ * our MVCC snapshot, so it is correct to skip the heap fetch for those TIDs.
+ */
+static void
+heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction,
+								  IndexScanBatch batch, HeapBatchData *hbatch,
+								  BatchRingItemPos *pos)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	int			posItem = pos->item;
+	bool		allbatchitemsvisible;
+	BlockNumber curvmheapblkno = InvalidBlockNumber;
+	uint8		curvmheapblkflags = 0;
+
+	Assert(hbatch == heap_batch_data(scan, batch));
+
+	/*
+	 * The batch must still be guarded (amunguardbatch has not been called
+	 * yet), so the TID recycling interlock is still in effect.
+	 */
+	Assert(!scan->batchImmediateUnguard);
+
+	/*
+	 * Set visibility info for a range of items, in scan order.
+	 *
+	 * Note: visibilitymap_get_status does not lock the visibility map buffer,
+	 * so the result could be slightly stale.  See the "Memory ordering
+	 * effects" discussion above visibilitymap_get_status for an explanation
+	 * of why this is okay.
+	 */
+	if (ScanDirectionIsForward(direction))
+	{
+		int			lastSetItem = Min(batch->lastItem,
+									  posItem + hscan->xs_vm_items - 1);
+
+		for (int setItem = posItem; setItem <= lastSetItem; setItem++)
+		{
+			ItemPointer tid = &batch->items[setItem].tableTid;
+			BlockNumber heapblkno = ItemPointerGetBlockNumber(tid);
+			uint8		flags;
+
+			if (heapblkno == curvmheapblkno)
+			{
+				hbatch->visInfo[setItem] = curvmheapblkflags;
+				continue;
+			}
+
+			flags = HEAP_BATCH_VIS_CHECKED;
+			if (VM_ALL_VISIBLE(scan->heapRelation, heapblkno, &hscan->xs_vmbuffer))
+				flags |= HEAP_BATCH_VIS_ALL_VISIBLE;
+
+			hbatch->visInfo[setItem] = curvmheapblkflags = flags;
+			curvmheapblkno = heapblkno;
+		}
+
+		allbatchitemsvisible = lastSetItem >= batch->lastItem &&
+			(posItem == batch->firstItem ||
+			 (hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED));
+	}
+	else
+	{
+		int			lastSetItem = Max(batch->firstItem,
+									  posItem - hscan->xs_vm_items + 1);
+
+		for (int setItem = posItem; setItem >= lastSetItem; setItem--)
+		{
+			ItemPointer tid = &batch->items[setItem].tableTid;
+			BlockNumber heapblkno = ItemPointerGetBlockNumber(tid);
+			uint8		flags;
+
+			if (heapblkno == curvmheapblkno)
+			{
+				hbatch->visInfo[setItem] = curvmheapblkflags;
+				continue;
+			}
+
+			flags = HEAP_BATCH_VIS_CHECKED;
+			if (VM_ALL_VISIBLE(scan->heapRelation, heapblkno, &hscan->xs_vmbuffer))
+				flags |= HEAP_BATCH_VIS_ALL_VISIBLE;
+
+			hbatch->visInfo[setItem] = curvmheapblkflags = flags;
+			curvmheapblkno = heapblkno;
+		}
+
+		allbatchitemsvisible = lastSetItem <= batch->firstItem &&
+			(posItem == batch->lastItem ||
+			 (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED));
+	}
+
+	/*
+	 * It's safe to unguard the batch (via amunguardbatch) as soon as we've
+	 * resolved the visibility status of all of its items (unless this is a
+	 * non-MVCC scan)
+	 */
+	if (allbatchitemsvisible)
+	{
+		Assert(hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED);
+		Assert(hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED);
+
+		if (batch->isGuarded && scan->MVCCScan)
+			tableam_util_unguard_batch(scan, batch);
+	}
+
+	/*
+	 * Else check visibility for twice as many items next time, or all items.
+	 * We check all items in one go once we're passed the scan's first batch.
+	 */
+	else if (hscan->xs_vm_items < (batch->lastItem - batch->firstItem))
+		hscan->xs_vm_items *= 2;
+	else
+		hscan->xs_vm_items = scan->maxitemsbatch;
+}
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/amapi.c b/src/backend/access/index/amapi.c
index efa007030..aba9e2b46 100644
--- a/src/backend/access/index/amapi.c
+++ b/src/backend/access/index/amapi.c
@@ -55,6 +55,11 @@ GetIndexAmRoutine(Oid amhandler)
 	Assert(routine->amrescan != NULL);
 	Assert(routine->amendscan != NULL);
 
+	/* Assert that AM doesn't have an invalid combination of callbacks */
+	Assert(routine->amkillitemsbatch == NULL || routine->amgetbatch != NULL);
+	Assert((routine->amgetbatch != NULL) == (routine->amunguardbatch != NULL));
+	Assert(routine->amgetbatch != NULL || routine->amposreset == NULL);
+
 	return routine;
 }
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index acc9f3e6a..17ea93b4d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->usebatchring = false; /* set later for amgetbatch callers */
+	memset(&scan->batchcache, 0, sizeof(scan->batchcache));
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -126,6 +128,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	scan->batch_index_opaque_size = 0;
+	scan->batch_tuples_workspace = 0;
+	scan->batch_table_offset = 0;
 	scan->xs_visited_pages_limit = 0;
 
 	return scan;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f08bc96bd..dc4c08a72 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -13,7 +13,7 @@
  * INTERFACE ROUTINES
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
- *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan - start a scan of an index with amgetbatch/amgettuple
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
@@ -42,6 +42,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -254,7 +255,7 @@ index_insert_cleanup(Relation indexRelation,
 }
 
 /*
- * index_beginscan - start a scan of an index with amgettuple
+ * index_beginscan - start a scan of an index with amgetbatch/amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
  */
@@ -307,7 +308,7 @@ index_beginscan_bitmap(Relation indexRelation,
  * index_beginscan_internal --- common code for index_beginscan variants
  *
  * When heapRelation is not NULL, also initializes heap-side scan state:
- * getnext_slot resolution and table fetch initialization.
+ * batch ring setup, getnext_slot resolution, and table fetch initialization.
  */
 static pg_attribute_always_inline IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
@@ -340,6 +341,7 @@ index_beginscan_internal(Relation indexRelation,
 	scan->xs_temp_snap = temp_snap;
 
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);
 	scan->instrument = instrument;
 
 	/*
@@ -351,13 +353,19 @@ index_beginscan_internal(Relation indexRelation,
 		scan->heapRelation = heapRelation;
 		scan->xs_want_itup = index_only_scan;
 		scan->xs_heap_continue = false;
+		scan->batchImmediateUnguard = (scan->MVCCScan && !index_only_scan);
+
+		if (indexRelation->rd_indam->amgetbatch != NULL)
+			batchscan_init(scan);
 
 		/* Resolve which getnext_slot implementation to use for this scan */
 		if (index_only_scan)
-			scan->xs_getnext_slot =
+			scan->xs_getnext_slot = scan->usebatchring ?
+				heapRelation->rd_tableam->index_only_amgetbatch_next :
 				heapRelation->rd_tableam->index_only_amgettuple_next;
 		else
-			scan->xs_getnext_slot =
+			scan->xs_getnext_slot = scan->usebatchring ?
+				heapRelation->rd_tableam->index_plain_amgetbatch_next :
 				heapRelation->rd_tableam->index_plain_amgettuple_next;
 
 		/* prepare to fetch index matches from table */
@@ -411,6 +419,17 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/*
+	 * amgetbitmap scans of an index AM that supports amgetbatch make limited
+	 * use of the scan's batch cache.  Check for that.
+	 */
+	if (!scan->usebatchring && scan->batchcache[0] != NULL)
+	{
+		Assert(scan->heapRelation == NULL);
+		Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+		pfree(batch_alloc_base(scan, scan->batchcache[0]));
+	}
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -439,24 +458,24 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	batchscan_mark_pos(scan);
 }
 
 /* ----------------
  *		index_restrpos	- restore a scan position
  *
- * NOTE: this only restores the internal scan state of the index AM.  See
+ * NOTE: this only restores the batch positional state of the table AM.  See
  * comments for ExecRestrPos().
  *
  * NOTE: For heap, in the presence of HOT chains, mark/restore only works
  * correctly if the scan's snapshot is MVCC-safe; that ensures that there's at
  * most one returnable tuple in each HOT chain, and so restoring the prior
- * state at the granularity of the index AM is sufficient.  Since the only
- * current user of mark/restore functionality is nodeMergejoin.c, this
- * effectively means that merge-join plans only work for MVCC snapshots.  This
- * could be fixed if necessary, but for now it seems unimportant.
+ * state at the scan item granularity is sufficient.  Since the only current
+ * user of mark/restore functionality is nodeMergejoin.c, this effectively
+ * means that merge-join plans only work for MVCC snapshots.  This could be
+ * fixed if necessary, but for now it seems unimportant.
  * ----------------
  */
 void
@@ -465,16 +484,12 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCLikeSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
 
-	/* reset table AM state for restoring the marked position */
-	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan);
-
-	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/* table AM restores the marked position for us */
+	table_index_fetch_restrpos(scan);
 }
 
 /*
@@ -648,6 +663,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
+	Assert(!scan->usebatchring);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..46876344b
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,771 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  Batch-based index scan infrastructure for the amgetbatch interface.
+ *
+ * This module provides the core infrastructure for batch-based index scans,
+ * which allow index AMs to return multiple matching TIDs per page in a single
+ * call.  The batch ring buffer is owned by the table AM, typically maintained
+ * alongside a read stream used for prefetching table blocks.
+ *
+ * The ring buffer loads batches in index key space/index scan order.  This
+ * allows the table AM to maintain an adequate prefetch distance: its read
+ * stream callback is thereby able to request table blocks referenced by index
+ * pages that are well ahead of the current scan position's index page.
+ *
+ * There's three types of functions in this module:
+ *
+ * 1. Core batch scan lifecycle (index_batchscan_*): Functions called by
+ *    indexam.c to manage batch scan state.  Currently just initialization
+ *    and the mark operation needed for merge joins.  (Restoring a mark is a
+ *    more complicated process which requires modifying table AM opaque state,
+ *    so the corresponding restore function is in category 2.)
+ *
+ * 2. Table AM utilities (tableam_util_*): Helper functions called by table
+ *    AMs during amgetbatch index scans.  These manage the scan's positional
+ *    state, and help with certain aspects of resource management.
+ *
+ * 3. Index AM utilities (indexam_util_*): Helper functions called by index
+ *    AMs that implement the amgetbatch interface.  Helps index AM manage
+ *    resources like memory, locks, and buffer pins.
+ *
+ * The table AM calls the table AM utility functions directly, and uses
+ * scanPos/scanBatch and prefetchPos/prefetchBatch in a standardized way (see
+ * heapam_indexscan.c for the reference implementation), while index AMs free
+ * and unlock batches as described in indexam.sgml.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/indexbatch.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+
+static void batch_free(IndexScanDesc scan, IndexScanBatch batch,
+					   bool allow_cache);
+static inline bool batch_cache_store(IndexScanDesc scan, IndexScanBatch batch);
+static int	batch_compare_int(const void *va, const void *vb);
+
+/*
+ * Sets up the batch ring buffer structure for use by an index scan.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+batchscan_init(IndexScanDesc scan)
+{
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->batchringbuf.scanPos.valid = false;
+	scan->batchringbuf.markPos.valid = false;
+
+	scan->batchringbuf.markBatch = NULL;
+	scan->batchringbuf.headBatch = 0;
+	scan->batchringbuf.nextBatch = 0;
+
+	scan->usebatchring = true;
+}
+
+/*
+ * Set a mark from scanPos position
+ *
+ * Saves the current scan position and associated batch so that the scan can
+ * be restored to this point later, via tableam_util_batch_restore_pos from
+ * the table AM.  The marked batch is retained and not freed until a new mark
+ * is set or the scan ends (or until the mark is restored).
+ */
+void
+batchscan_mark_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	bool		freeMarkBatch;
+
+	Assert(scan->MVCCScan);
+
+	/*
+	 * Free the previous mark batch (if any) -- but only if it isn't our
+	 * scanBatch (defensively make sure that markBatch isn't some later
+	 * still-needed batch, too)
+	 */
+	if (!markBatch || markBatch == scanBatch)
+	{
+		/* Definitely no markBatch that we should free now */
+		freeMarkBatch = false;
+	}
+	else if (likely(!index_scan_batch_loaded(scan, markPos->batch)))
+	{
+		/* Definitely have a no-longer-loaded markBatch to free */
+		freeMarkBatch = true;
+	}
+	else
+	{
+		/*
+		 * index_scan_batch_loaded indicates that markPos->batch is loaded,
+		 * but after uint8 overflow a stale batch offset can alias a
+		 * currently-loaded range (false positive).  Confirm by checking
+		 * whether the batch pointer in markPos->batch's slot still matches.
+		 */
+		freeMarkBatch = (index_scan_batch(scan, markPos->batch) != markBatch);
+	}
+
+	if (freeMarkBatch)
+	{
+		/* Free markBatch, since it isn't loaded/needed for batchringbuf */
+		batchringbuf->markBatch = NULL; /* else call won't free markBatch */
+		tableam_util_free_batch(scan, markBatch);
+	}
+
+	/* copy the scan's position */
+	batchringbuf->markPos = *scanPos;
+	batchringbuf->markBatch = scanBatch;
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by table AMs
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Restore mark to scanPos position
+ *
+ * Restores the scan to a position saved by batchscan_mark_pos earlier.  The
+ * scan's markPos becomes its scanPos.  The marked batch is restored as the
+ * current scanBatch when needed.
+ *
+ * We just discard all batches (other than markBatch/restored scanBatch),
+ * except when markBatch is already the scan's current scanBatch.
+ */
+void
+tableam_util_batchscan_restore_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	Assert(scan->MVCCScan);
+	Assert(scan->xs_heapfetch);
+	Assert(markPos->valid);
+
+	if (scanBatch == markBatch)
+	{
+		/* markBatch is already scanBatch; needn't change batchringbuf */
+		Assert(scanPos->batch == markPos->batch);
+
+		scanPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * markBatch is behind scanBatch, and so must not be saved in ring buffer
+	 * anymore.  We have to deal with restoring the mark the hard way: by
+	 * invalidating all other loaded batches.  This is similar to the case
+	 * where the scan direction changes and the scan actually crosses
+	 * batch/index page boundaries (see tableam_util_scanbatch_dirchange).
+	 *
+	 * First, free all batches that are still in the ring buffer.
+	 */
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		Assert(batch != markBatch);
+
+		tableam_util_free_batch(scan, batch);
+	}
+
+	/*
+	 * Next "append" standalone markBatch, which will become scanBatch
+	 * (scanBatch is always the ring buffer's headBatch)
+	 */
+	markPos->batch = 0;
+	batchringbuf->scanPos = *markPos;
+	batchringbuf->nextBatch = batchringbuf->headBatch = markPos->batch;
+	index_scan_batch_append(scan, markBatch);
+	Assert(index_scan_batch(scan, batchringbuf->scanPos.batch) == markBatch);
+
+	/*
+	 * Finally, call amposreset to let index AM know to invalidate any private
+	 * state that independently tracks the scan's progress
+	 */
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Note: markBatch.deadItems[] might already contain dead items, and might
+	 * yet have more dead items saved.  tableam_util_free_batch is prepared
+	 * for that.
+	 */
+}
+
+/*
+ * Reset state used for a batch index scan
+ */
+void
+tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	bool		markBatchFreed = false;
+
+	batchringbuf->scanPos.valid = false;
+	batchringbuf->markPos.valid = false;
+
+	/* Ensure batch_free won't skip the old markBatch in the loop below */
+	batchringbuf->markBatch = NULL;
+
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		if (batch == markBatch)
+			markBatchFreed = true;
+
+		batch_free(scan, batch, !endscan);
+	}
+
+	if (!markBatchFreed && unlikely(markBatch))
+		batch_free(scan, markBatch, !endscan);
+
+	batchringbuf->headBatch = 0;
+	batchringbuf->nextBatch = 0;
+}
+
+/*
+ * Free resources at end of batch index scan
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ */
+void
+tableam_util_batchscan_end(IndexScanDesc scan)
+{
+	/* Free all remaining loaded batches (even markBatch), bypassing cache */
+	tableam_util_batchscan_reset(scan, true);
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		IndexScanBatch cached = scan->batchcache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->deadItems)
+			pfree(cached->deadItems);
+		pfree(batch_alloc_base(scan, cached));
+	}
+}
+
+/*
+ * Handle cross-batch change in scan direction
+ *
+ * Called by table AM when its scan changes direction in a way that
+ * necessitates backing the scan up to an index page originally associated
+ * with a now-freed batch.
+ *
+ * When we return, batchringbuf will only contain one batch (the current
+ * headBatch/scanBatch) and will look as if the new scan direction had been
+ * used from the start.  Caller can then safely pass this batch to amgetbatch
+ * to determine which batch comes next in the new scan direction.  This
+ * approach isn't particularly efficient, but it works well enough for what
+ * ought to be a relatively rare occurrence.
+ */
+void
+tableam_util_scanbatch_dirchange(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch scanBatch;
+
+	/*
+	 * Release batches starting from the current "tail" batch, working
+	 * backwards until the current head batch (which is also the current
+	 * scanBatch) is the only batch hasn't been freed
+	 */
+	while (index_scan_batch_count(scan) > 1)
+	{
+		uint8		tailidx = batchringbuf->nextBatch - 1;
+		IndexScanBatch tail = index_scan_batch(scan, tailidx);
+
+		Assert(tailidx != batchringbuf->scanPos.batch);
+
+		tableam_util_free_batch(scan, tail);
+		batchringbuf->nextBatch--;
+	}
+
+	/* scanBatch is now the only batch still loaded */
+	Assert(batchringbuf->headBatch == batchringbuf->scanPos.batch);
+	scanBatch = index_scan_batch(scan, batchringbuf->headBatch);
+
+	/*
+	 * Flip scanBatch's scan direction to reflect the reversal.  Also reset
+	 * any index AM state that independently tracks scan progress.
+	 */
+	scanBatch->dir = -scanBatch->dir;
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, scanBatch);
+}
+
+/*
+ * Record that scanPos item is dead
+ *
+ * Records an offset to the current scanBatch/scanPos item, saving it in
+ * scanBatch's deadItems array.  The items' index tuples will later be
+ * marked LP_DEAD when current scanBatch is freed.
+ */
+void
+tableam_util_scanpos_killitem(IndexScanDesc scan)
+{
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	if (scanBatch->deadItems == NULL)
+		scanBatch->deadItems = palloc_array(int, scan->maxitemsbatch);
+	if (scanBatch->numDead < scan->maxitemsbatch)
+		scanBatch->deadItems[scanBatch->numDead++] = scanPos->item;
+}
+
+/*
+ * Release resources associated with a batch
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * We call amunguardbatch to drop the TID recycling interlock (e.g. buffer
+ * pin) when it hasn't been dropped yet.  For plain MVCC scans (where
+ * batchImmediateUnguard is set), the interlock was already dropped eagerly
+ * in indexam_util_batch_unlock, so we skip the amunguardbatch call here.
+ * Index-only scans must delay dropping the interlock until visibility is
+ * resolved for all items in the batch, so amunguardbatch may still need to
+ * act here.  For non-MVCC snapshot scans, the interlock is always held
+ * until amunguardbatch drops it here -- this is the only place willing to
+ * unguard a non-MVCC scan's batch.
+ *
+ * When the batch has dead items (numDead > 0) and the index AM provides an
+ * amkillitemsbatch callback, we call it to set LP_DEAD bits in the index
+ * page.  We always recycle the batch memory via indexam_util_batch_release.
+ *
+ * Note: Calling here when 'batch' is also batchringbuf.markBatch is a no-op.
+ * Callers that don't want this should set batchringbuf.markBatch to NULL
+ * before calling us.  Note that markBatch has to be explicitly freed.
+ */
+void
+tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* Pass through to implementation function, with allow_cache=true */
+	batch_free(scan, batch, true);
+}
+
+/*
+ * Free a batch, optionally caching it for reuse.
+ *
+ * tableam_util_free_batch implementation function.  We split out the
+ * implementation like this because we don't want to give external table AM
+ * callers the option of passing allow_cache=false.
+ *
+ * When allow_cache is true, we try to store the batch in the scan's batch
+ * cache for later reuse.  When allow_cache is false (typically because the
+ * scan is shutting down), we pfree the caller's batch unconditionally.
+ */
+static void
+batch_free(IndexScanDesc scan, IndexScanBatch batch, bool allow_cache)
+{
+	Assert(!(scan->batchImmediateUnguard && batch->isGuarded));
+	Assert(batch->isGuarded || scan->MVCCScan);
+
+	/* don't free caller's batch if it is scan's current markBatch */
+	if (batch == scan->batchringbuf.markBatch)
+		return;
+
+	/* Drop TID recycling interlock via amunguardbatch as needed */
+	if (!scan->batchImmediateUnguard && batch->isGuarded)
+		tableam_util_unguard_batch(scan, batch);
+
+	/*
+	 * Let the index AM set LP_DEAD bits in the index page, if applicable.
+	 *
+	 * batch.deadItems[] is now in whatever order the scan returned items in.
+	 * We might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify deadItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numDead > 0 &&
+		scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
+	{
+		if (batch->numDead > 1)
+		{
+			qsort(batch->deadItems, batch->numDead, sizeof(int),
+				  batch_compare_int);
+			batch->numDead = qunique(batch->deadItems, batch->numDead,
+									 sizeof(int), batch_compare_int);
+		}
+
+		scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
+	}
+
+	/*
+	 * Try to store caller's batch in this amgetbatch scan's cache of
+	 * previously released batches first (when caller requests it)
+	 */
+	if (allow_cache && batch_cache_store(scan, batch))
+		return;
+
+	/* just pfree the caller's batch (plus batch's deadItems, if any) */
+	if (batch->deadItems)
+		pfree(batch->deadItems);
+	pfree(batch_alloc_base(scan, batch));
+}
+
+/*
+ * Drop the batch's TID recycling interlock via amunguardbatch
+ *
+ * Called by the table AM when it's safe to drop whatever interlock the index
+ * AM holds to prevent unsafe concurrent TID recycling by VACUUM (typically a
+ * buffer pin on the batch's index page in batch's opaque area).
+ */
+void
+tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	scan->indexRelation->rd_indam->amunguardbatch(scan, batch);
+
+	batch->isGuarded = false;
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Unlock batch's index page buffer lock
+ *
+ * Unlocks the given buffer in preparation for amgetbatch returning items
+ * saved in that batch.  Performs extra steps required by amgetbatch callers
+ * in passing.
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbatch and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We drop both
+ * the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ *
+ * For amgetbatch callers, when batchImmediateUnguard is set (plain MVCC
+ * scans), we also release the pin here (the TID recycling interlock), so
+ * that no later amunguardbatch callback will be needed.  Otherwise the table
+ * AM will call amunguardbatch later when it's safe to drop the interlock.
+ *
+ * Index AMs whose TID recycling interlock is not just a buffer pin, or whose
+ * amunguardbatch does not simply release a pin, are not obligated to use this
+ * function.  They can implement their own equivalent.  Such index AMs are also
+ * free to use the batch LSN field themselves; their amkillitemsbatch routine
+ * can use that LSN in the usual way, or in whatever way the AM deems necessary
+ * (core code will not use it for any other purpose).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (scan->usebatchring)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amkillitemsbatch has a way to detect
+		 * when concurrent heap TID recycling by VACUUM might have taken
+		 * place.  It'll only be safe to set any index tuple LP_DEAD bits when
+		 * the page LSN hasn't advanced.
+		 *
+		 * Plain MVCC scans (batchImmediateUnguard) also release the pin now,
+		 * dropping the TID recycling interlock so that no amunguardbatch
+		 * callback will be needed later.  The index AM caller must clear its
+		 * own opaque buf field after we return.
+		 *
+		 * Non-immediate-unguard scans retain the pin; the table AM will call
+		 * amunguardbatch to drop the interlock when ready.
+		 */
+		batch->lsn = BufferGetLSNAtomic(buf);
+		if (scan->batchImmediateUnguard)
+		{
+			/* drop both the lock and the pin */
+			UnlockReleaseBuffer(buf);
+		}
+		else
+		{
+			/* just drop the lock (hold on to interlock pin) */
+			UnlockBuffer(buf);
+		}
+
+		/* If we released buffer pin, batch is now unguarded */
+		batch->isGuarded = !scan->batchImmediateUnguard;
+	}
+	else
+	{
+		/* amgetbitmap (not amgetbatch) caller */
+		Assert(scan->heapRelation == NULL);
+
+		/* drop both the lock and the pin */
+		UnlockReleaseBuffer(buf);
+	}
+}
+
+/*
+ * Allocate a new batch
+ *
+ * Used by index AMs that support amgetbatch interface (both during amgetbatch
+ * and amgetbitmap scans).
+ *
+ * Returns IndexScanBatch with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Housekeeping fields (buf, knownEndBackward/Forward, firstItem, lastItem,
+ * numDead, deadItems, currTuples) are initialized here.  The table AM's
+ * batch_init callback is invoked here to initialize the table AM opaque area.
+ * The index AM caller is responsible for filling in its per-batch opaque
+ * fields and the matching items[] array.
+ *
+ * Once the batch has the required matching items, caller should generally
+ * pass it to indexam_util_batch_unlock, ahead of it being returned through
+ * index AM's amgetbatch routine.  If it turns out that the batch won't need
+ * to be returned like this (e.g., due to the scan having no more matches),
+ * caller should pass its empty/unused batch to indexam_util_batch_release.
+ */
+IndexScanBatch
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	IndexScanBatch batch = NULL;
+	bool		new_alloc = false;
+
+	/*
+	 * Lazily compute batch_table_offset on first allocation.  This combines
+	 * the table AM and index AM opaque sizes into a single offset that can be
+	 * used to find the table AM opaque area (and the true allocation base)
+	 * from the batch pointer.
+	 */
+	if (scan->batch_table_offset == 0 &&
+		(scan->batch_index_opaque_size > 0 ||
+		 (scan->xs_heapfetch && scan->xs_heapfetch->batch_opaque_size > 0)))
+	{
+		uint16		table_opaque = scan->xs_heapfetch ?
+			scan->xs_heapfetch->batch_opaque_size : 0;
+
+		scan->batch_table_offset = table_opaque +
+			scan->batch_index_opaque_size;
+	}
+
+	/* First look for an existing batch from the cache */
+	if (scan->usebatchring)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchcache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchcache[i];
+				scan->batchcache[i] = NULL;
+				break;
+			}
+		}
+	}
+	else if (scan->batchcache[0] != NULL)
+	{
+		/*
+		 * Reuse cached batch from prior amgetbitmap iteration.  This path is
+		 * hit on every amgetbitmap call here after the scan's first.
+		 */
+		batch = scan->batchcache[0];
+		scan->batchcache[0] = NULL;
+	}
+
+	if (!batch)
+	{
+		Size		prefix_sz;
+		Size		base_sz;
+		Size		trailing_sz;
+		Size		allocsz;
+		char	   *raw;
+
+		/* AM opaque areas before the batch pointer */
+		prefix_sz = scan->batch_table_offset;
+
+		/* IndexScanBatchData header + items[] */
+		base_sz = offsetof(IndexScanBatchData, items) +
+			sizeof(BatchMatchingItem) * scan->maxitemsbatch;
+
+		/*
+		 * Trailing data after items[]: per-item data (owned by table AM),
+		 * then currTuples workspace (owned by index AM, read by table AM)
+		 */
+		trailing_sz = 0;
+		if (scan->xs_want_itup)
+		{
+			if (scan->xs_heapfetch &&
+				scan->xs_heapfetch->batch_per_item_size > 0)
+				trailing_sz += MAXALIGN(scan->xs_heapfetch->batch_per_item_size *
+										scan->maxitemsbatch);
+			trailing_sz += scan->batch_tuples_workspace;
+		}
+
+		allocsz = prefix_sz + MAXALIGN(base_sz) + trailing_sz;
+		raw = palloc(allocsz);
+		batch = (IndexScanBatch) (raw + prefix_sz);
+
+		/* Set up currTuples pointer for index-only scans */
+		if (scan->xs_want_itup && scan->batch_tuples_workspace > 0)
+		{
+			Size		itemsEnd = MAXALIGN(base_sz);
+			Size		tableTrailing = 0;
+
+			if (scan->xs_heapfetch &&
+				scan->xs_heapfetch->batch_per_item_size > 0)
+				tableTrailing = MAXALIGN(scan->xs_heapfetch->batch_per_item_size *
+										 scan->maxitemsbatch);
+			batch->currTuples = (char *) batch + itemsEnd + tableTrailing;
+		}
+		else
+			batch->currTuples = NULL;
+
+		/*
+		 * Batches allocate deadItems lazily (though note that cached batches
+		 * keep their deadItems allocation when recycled)
+		 */
+		batch->deadItems = NULL;
+		new_alloc = true;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && scan->batch_tuples_workspace > 0 &&
+			 batch->currTuples == NULL));
+
+	/* Let the table AM initialize its per-batch opaque area */
+	if (scan->xs_heapfetch)
+		table_index_fetch_batch_init(scan, batch, new_alloc);
+
+	/* shared initialization */
+	batch->knownEndBackward = false;
+	batch->knownEndForward = false;
+	batch->isGuarded = false;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numDead = 0;
+
+	return batch;
+}
+
+/*
+ * Release allocated batch
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse to reduce
+ * palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by tableam_util_free_batch, which calls here after the
+ * index AM's amkillitemsbatch routine (if any).  Index AMs that use batches
+ * should call here to release a batch from their amgetbatch or amgetbitmap
+ * routines.
+ *
+ * The rules for batch ownership differ slightly for amgetbitmap scans; see
+ * the amgetbitmap documentation in doc/src/sgml/indexam.sgml for details.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	if (!scan->usebatchring)
+	{
+		/*
+		 * amgetbitmap scan caller.
+		 *
+		 * amgetbitmap routines are required to allocate no more than one
+		 * batch at a time, so we'll always have a free slot.
+		 */
+		Assert(scan->batchcache[0] == NULL);
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->deadItems == NULL);
+		Assert(batch->currTuples == NULL);
+
+		scan->batchcache[0] = batch;
+		return;
+	}
+
+	/* amgetbatch scan caller */
+	Assert(scan->heapRelation != NULL);
+
+	/*
+	 * Try to store caller's batch in this amgetbatch scan's cache of
+	 * previously released batches first
+	 */
+	if (batch_cache_store(scan, batch))
+		return;
+
+	/* Cache full; just free the caller's batch */
+	if (batch->deadItems)
+		pfree(batch->deadItems);
+	pfree(batch_alloc_base(scan, batch));
+}
+
+/*
+ * Try to store a batch in the scan's batch cache.
+ *
+ * Returns true if a free slot was found, false if the cache is full.
+ */
+static inline bool
+batch_cache_store(IndexScanDesc scan, IndexScanBatch batch)
+{
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchcache[i] == NULL)
+		{
+			scan->batchcache[i] = batch;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index cb921ca2e..a37869b71 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -179,18 +179,15 @@ hold on to the pin (used when reading from the leaf page) until _after_
 they're done visiting the heap (for TIDs from pinned leaf page) prevents
 concurrent TID recycling.  VACUUM cannot get a conflicting cleanup lock
 until the index scan is totally finished processing its leaf page.
+This is required by any index AM that implements the amgetbatch
+interface.  (See also, doc/src/sgml/indexam.sgml).
 
-This approach is fairly coarse, so we avoid it whenever possible.  In
-practice most index scans won't hold onto their pin, and so won't block
-VACUUM.  These index scans must deal with TID recycling directly, which is
-more complicated and not always possible.  See later section on making
-concurrent TID recycling safe.
-
-Opportunistic index tuple deletion performs almost the same page-level
-modifications while only holding an exclusive lock.  This is safe because
-there is no question of TID recycling taking place later on -- only VACUUM
-can make TIDs recyclable.  See also simple deletion and bottom-up
-deletion, below.
+Opportunistic index tuple deletion performs the same page-level
+modifications as VACUUM, while only holding an exclusive lock.  This is
+safe because there is no question of TID recycling taking place -- only
+VACUUM can make TIDs recyclable.  In other words, VACUUM's cleanup lock
+serves to protect non-MVCC snapshot scans from concurrent TID recycling
+hazards; it doesn't protect the B-Tree structure itself.
 
 Because a pin is not always held, and a page can be split even while
 someone does hold a pin on it, it is possible that an indexscan will
@@ -440,54 +437,6 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
-Making concurrent TID recycling safe
-------------------------------------
-
-As explained in the earlier section about deleting index tuples during
-VACUUM, we implement a locking protocol that allows individual index scans
-to avoid concurrent TID recycling.  Index scans opt-out (and so drop their
-leaf page pin when visiting the heap) whenever it's safe to do so, though.
-Dropping the pin early is useful because it avoids blocking progress by
-VACUUM.  This is particularly important with index scans used by cursors,
-since idle cursors sometimes stop for relatively long periods of time.  In
-extreme cases, a client application may hold on to an idle cursors for
-hours or even days.  Blocking VACUUM for that long could be disastrous.
-
-Index scans that don't hold on to a buffer pin are protected by holding an
-MVCC snapshot instead.  This more limited interlock prevents wrong answers
-to queries, but it does not prevent concurrent TID recycling itself (only
-holding onto the leaf page pin while accessing the heap ensures that).
-
-Index-only scans can never drop their buffer pin, since they are unable to
-tolerate having a referenced TID become recyclable.  Index-only scans
-typically just visit the visibility map (not the heap proper), and so will
-not reliably notice that any stale TID reference (for a TID that pointed
-to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
-the heap by VACUUM.  This could easily allow VACUUM to set the whole heap
-page to all-visible in the visibility map immediately afterwards.  An MVCC
-snapshot is only sufficient to avoid problems during plain index scans
-because they must access granular visibility information from the heap
-proper.  A plain index scan will even recognize LP_UNUSED items in the
-heap (items that could be recycled but haven't been just yet) as "not
-visible" -- even when the heap page is generally considered all-visible.
-
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
-(described in full in simple deletion, below) is also more complicated for
-index scans that drop their leaf page pins.  We must be careful to avoid
-LP_DEAD-marking any new index tuple that looks like a known-dead index
-tuple because it happens to share the same TID, following concurrent TID
-recycling.  It's just about possible that some other session inserted a
-new, unrelated index tuple, on the same leaf page, which has the same
-original TID.  It would be totally wrong to LP_DEAD-set this new,
-unrelated index tuple.
-
-We handle this kill_prior_tuple race condition by having affected index
-scans conservatively assume that any change to the leaf page at all
-implies that it was reached by btbulkdelete in the interim period when no
-buffer pin was held.  This is implemented by not setting any LP_DEAD bits
-on the leaf page at all when the page's LSN has changed.  (This is why we
-implement "fake" LSNs for unlogged index relations.)
-
 Fastpath For Index Insertion
 ----------------------------
 
@@ -734,7 +683,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
@@ -756,9 +705,8 @@ non-MVCC scans is not required on standby nodes. We still get a full
 cleanup lock when replaying VACUUM records during recovery, but recovery
 does not need to lock every leaf page (only those leaf pages that have
 items to delete) -- that's sufficient to avoid breaking index-only scans
-during recovery (see section above about making TID recycling safe). That
-leaves concern only for plain index scans. (XXX: Not actually clear why
-this is totally unnecessary during recovery.)
+during recovery. That leaves concern only for plain index scans.
+(XXX: Not actually clear why this is totally unnecessary during recovery.)
 
 MVCC snapshot plain index scans are always safe, for the same reasons that
 they're safe during original execution.  HeapTupleSatisfiesToast() doesn't
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 054703861..0046c84d1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1060,6 +1060,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: _bt_batch_unlock in nbtsearch.c (indexam_util_batch_unlock wrapper
+ * function) has matching Valgrind buffer lock instrumentation.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
@@ -1101,13 +1104,19 @@ _bt_unlockbuf(Relation rel, Buffer buf)
 	 * Buffer is pinned and locked, which means that it is expected to be
 	 * defined and addressable.  Check that proactively.
 	 */
-	VALGRIND_CHECK_MEM_IS_DEFINED(BufferGetPage(buf), BLCKSZ);
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
+
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
 
 	/* LockBuffer() asserts that pin is held by this backend */
 	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
+#if defined(USE_VALGRIND)
 	if (!RelationUsesLocalBuffers(rel))
-		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ);
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 2ba1ca660..39c661498 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+								  OffsetNumber offnum, const ItemPointerData *tableTid,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer tableTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,11 +131,12 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btnewbatch = BTBatchGetData(scan, newbatch);
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
@@ -144,23 +145,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btnewbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = btnewbatch->currPage = BufferGetBlockNumber(btnewbatch->buf);
+	btnewbatch->prevPage = opaque->btpo_prev;
+	btnewbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? btnewbatch->moreRight : btnewbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +186,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->nextPage,
+								 btnewbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->prevPage,
+								 btnewbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +210,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					btnewbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +278,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +337,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			btnewbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +358,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					btnewbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +463,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +499,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			btnewbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +520,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1023,91 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	newbatch->items[itemIndex].tableTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *tableTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer tableTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2821,14 +2818,15 @@ new_prim_scan:
 	 *
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
-	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * The scan direction might be reversed during the next amgetbatch call,
+	 * but not before a call to btposreset that resets the array keys to the
+	 * first positions/elements used when scanning in this other direction.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2841,9 +2839,8 @@ end_toplevel_scan:
 	 * This ends the entire top-level scan in the current scan direction.
 	 *
 	 * Note: The scan's arrays (including any non-required arrays) are now in
-	 * their final positions for the current scan direction.  If the scan
-	 * direction happens to change, then the arrays will already be in their
-	 * first positions for what will then be the current scan direction.
+	 * their final positions for the current scan direction.  This is just
+	 * defensive.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = false;	/* ...and don't call _bt_first again */
@@ -2910,17 +2907,9 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	/*
 	 * The array keys are now exhausted.
 	 *
-	 * Restore the array keys to the state they were in immediately before we
-	 * were called.  This ensures that the arrays only ever ratchet in the
-	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
+	 * Defensively restore the array keys to the positions they were in
+	 * immediately before we were called (i.e. to their final positions for
+	 * the current scan direction).
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6d870e4eb..77af09f4c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,11 +161,13 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amkillitemsbatch = btkillitemsbatch,
+		.amunguardbatch = btunguardbatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -224,13 +226,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -243,45 +245,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * If we've already initialized this scan, we can just advance it in
 		 * the appropriate direction.  If we haven't done so yet, we call
-		 * _bt_first() to get the first item in the scan.
+		 * _bt_first() to get the first batch in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -291,38 +268,43 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	ItemPointer heapTid;
+	ItemPointer tableTid;
 
 	Assert(scan->heapRelation == NULL);
 
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
-			tbm_add_tuples(tbm, heapTid, 1, false);
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			tableTid = &batch->items[itemIndex].tableTid;
+			tbm_add_tuples(tbm, tableTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
-					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					/*
+					 * _bt_next releases the prior batch for bitmap callers
+					 * before allocating the next one, so only one batch is
+					 * ever used at a time
+					 */
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
-				tbm_add_tuples(tbm, heapTid, 1, false);
+				tableTid = &batch->items[itemIndex].tableTid;
+				tbm_add_tuples(tbm, tableTid, 1, false);
 				ntids++;
 			}
 		}
@@ -349,8 +331,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -364,19 +344,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
+	scan->batch_index_opaque_size = MAXALIGN(sizeof(BTBatchData));
+	scan->batch_tuples_workspace = BLCKSZ;
 
 	return scan;
 }
@@ -390,64 +362,185 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCLikeSnapshot(scan->xs_snapshot) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	bool		killedsomething = false;
+	Buffer		buf;
+	XLogRecPtr	latestlsn;
+
+	/* Table AM should have already released batch page's pin by now */
+	Assert(batch->numDead > 0);
+
+	buf = _bt_getbuf(rel, btbatch->currPage, BT_READ);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_bt_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = BTPageGetOpaque(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Iterate through batch->deadItems[] in leaf page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
+		OffsetNumber offnum = kitem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
+		Assert(i == 0 ||
+			   offnum >= batch->items[batch->deadItems[i - 1]].indexOffset);
+
+		if (offnum < minoff)
+			continue;			/* pure paranoia */
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
+
+			if (BTreeTupleIsPosting(ituple))
+			{
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->tableTid))
+						break;	/* out of posting list loop */
+
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < batch->numDead)
+						kitem = &batch->items[batch->deadItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing dead items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->tableTid))
+				killtuple = true;
+
+			/*
+			 * Mark index item as dead, if it isn't already.  Since this
+			 * happens while holding a shared buffer lock, it's possible that
+			 * multiple processes attempt to do this simultaneously, leading
+			 * to multiple full-page images being sent to WAL (if
+			 * wal_log_hints or data checksums are enabled), which is
+			 * undesirable.
+			 */
+			if (killtuple && !ItemIdIsDead(iid))
+			{
+				if (!killedsomething)
+				{
+					/*
+					 * Use the hint bit infrastructure to check if we can
+					 * update the page while just holding a share lock. If we
+					 * are not allowed, there's no point continuing.
+					 */
+					if (!BufferBeginSetHintBits(buf))
+						goto unlock_page;
+				}
+
+				/* found the item/all posting list items */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 *
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
+	 * only rely on the page-level flag in !heapkeyspace indexes.)
+	 */
+	if (killedsomething)
+	{
+		opaque->btpo_flags |= BTP_HAS_GARBAGE;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+unlock_page:
+	_bt_relbuf(rel, buf);
+}
+
+/*
+ *	btunguardbatch() -- Drop batch's TID recycling interlock (buffer pin)
+ *
+ * Called by the table AM when it's safe to drop the buffer pin held to
+ * prevent concurrent TID recycling by VACUUM.
+ */
+void
+btunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(btbatch->buf);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -456,116 +549,63 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- reset array key state for scan position change
+ *
+ * Called by the core system when the scan's logical position is about to
+ * change in a way that invalidates our array key state.  This happens when
+ * restoring a marked position, or when the scan crosses a batch boundary
+ * while moving in the opposite direction to the one originally used.
+ *
+ * For direction changes, the core system will have already flipped the
+ * batch's dir field before calling here; we use this updated direction when
+ * resetting our array keys.  For mark restoration, the batch's dir will
+ * retain its original value (from when btgetbatch returned it).
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch batch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Reset array keys to initial state for the batch's scan direction.  Also
+	 * clear needPrimScan and related flags.  These were set based on the soft
+	 * assumption that the scan would always proceed in the same direction.
+	 *
+	 * These steps work around the soft assumption being violated: they force
+	 * the scan to step to the next/previous page, making the arrays recover.
+	 * When we go to read that page, _bt_readpage will reliably determine if a
+	 * primitive scan really is needed based on the page's tuples.  If there's
+	 * a primitive scan, it will reposition the scan using new array values
+	 * (based on the tuples from the neighboring page we'll step on to).
+	 *
+	 * We need to reset the array key state in the correct direction so that
+	 * we won't get confused.  When the array keys are behind the key space
+	 * for the page we're stepping on to (behind in terms of the scan dir),
+	 * they will catch up automatically.  But when they're ahead of that
+	 * page's key space, the scan could miss matching tuples.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, batch->dir);
+	if (ScanDirectionIsForward(batch->dir))
+		btbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		btbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -881,15 +921,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1039,8 +1070,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aae6acb7f..c089ec38d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -23,53 +23,49 @@
 #include "pgstat.h"
 #include "storage/predicate.h"
 #include "utils/lsyscache.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
+static inline void _bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
+									Buffer buf);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 
 /*
- *	_bt_drop_lock_and_maybe_pin()
+ * _bt_batch_unlock() -- nbtree wrapper for indexam_util_batch_unlock.
  *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ * Performs the same Valgrind instrumentation as _bt_unlockbuf.
  */
 static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
+_bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
 {
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
 
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
+
+	indexam_util_batch_unlock(scan, batch, buf);
+
+#if defined(USE_VALGRIND)
+	if (!RelationUsesLocalBuffers(scan->indexRelation))
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
@@ -860,26 +856,25 @@ _bt_compare(Relation rel,
 }
 
 /*
- *	_bt_first() -- Find the first item in a scan.
+ *	_bt_first() -- Find the first batch in a scan.
  *
  *		We need to be clever about the direction of scan, the search
- *		conditions, and the tree ordering.  We find the first item (or,
- *		if backwards scan, the last item) in the tree that satisfies the
- *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		conditions, and the tree ordering.  We find the first leaf page (or
+ *		the last leaf page, when scanning backwards) in the tree with at least
+ *		one tuple that satisfies the qualifications in the scan key.  On
+ *		success exit, we return a new batch with that page's matching items.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index (in the given scan direction),
+ * we just return NULL.  Note that returning NULL doesn't necessarily mean the
+ * end of the top-level scan; caller should check so->needPrimScan to
+ * determine if another primitive index scan is required.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -892,8 +887,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	IndexScanBatch firstbatch;
+	BTBatchData *btfirstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
+	btfirstbatch = BTBatchGetData(scan, firstbatch);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -909,7 +908,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
-		return false;
+		indexam_util_batch_release(scan, firstbatch);
+		return NULL;
 	}
 
 	/*
@@ -918,7 +918,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return NULL;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -938,11 +941,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		Assert(!so->needPrimScan);
 		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1242,7 +1243,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1502,7 +1503,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		default:
 			/* can't get here, but keep compiler quiet */
 			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return NULL;
 	}
 
 	/*
@@ -1510,9 +1511,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+	_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(btfirstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1528,22 +1529,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+			_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(btfirstbatch->buf))
 		{
 			_bt_parallel_done(scan);
-			return false;
+			indexam_util_batch_release(scan, firstbatch);
+			return NULL;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, btfirstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1561,164 +1563,72 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
- *	_bt_next() -- Get the next item in a scan.
+ *	_bt_next() -- Get the next batch in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which leaf page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
- *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
+ *		Note that returning NULL doesn't necessarily mean the end of the
+ *		top-level scan; caller should check so->needPrimScan to determine
+ *		if another primitive index scan is required.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btpriorbatch = BTBatchGetData(scan, priorbatch);
 	BlockNumber blkno,
 				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
+	bool		moreInDir;
 
 	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
+	 * The core code must deal with cross-batch scan direction changes for us.
+	 * A batch management routine that flips priorbatch's scan direction (and
+	 * calls btposreset to deal with the scan's array keys) is used for this.
 	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(priorbatch->dir == dir);
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = btpriorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = btpriorbatch->prevPage;
+	lastcurrblkno = btpriorbatch->currPage;
+	moreInDir = ScanDirectionIsForward(dir) ?
+		btpriorbatch->moreRight : btpriorbatch->moreLeft;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * For bitmap scan callers, release the prior batch now so that
+	 * _bt_readnextpage can reuse its memory.  That way bitmap scans never
+	 * need more than one batch allocation.
 	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
+	if (!scan->usebatchring)
+		indexam_util_batch_release(scan, priorbatch);
+
+	if (blkno == P_NONE || !moreInDir)
+	{
+		/*
+		 * priorbatch's page is known to be the final leaf page with matches
+		 * in this scan direction (its _bt_readpage call figured that out).
+		 *
+		 * Note: if so->needPrimScan is set, then priorbatch's leaf page is
+		 * actually just the final page for the current primitive index scan
+		 * in this scan direction (the scan will continue in _bt_first).
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
 
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
@@ -1732,178 +1642,169 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
- * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
+ * On success exit, returns unlocked batch containing data from the next page
+ * that has at least one matching item.  If there are no matching items in the
+ * given scan direction, we just return NULL.  Note that returning NULL
+ * doesn't necessarily mean the end of the top-level scan; btgetbatch and
+ * btgetbitmap check so->needPrimScan to determine if another primitive index
+ * scan is required.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btfirstbatch = BTBatchGetData(scan, firstbatch);
+	BlockNumber blkno,
+				lastcurrblkno;
+	bool		moreInDir;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = false;
+		btfirstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		_bt_batch_unlock(scan, firstbatch, btfirstbatch->buf);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page returned by _bt_search */
+	_bt_relbuf(scan->indexRelation, btfirstbatch->buf);
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = btfirstbatch->nextPage;
+	else
+		blkno = btfirstbatch->prevPage;
+	lastcurrblkno = btfirstbatch->currPage;
+	moreInDir = ScanDirectionIsForward(dir) ?
+		btfirstbatch->moreRight : btfirstbatch->moreLeft;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	/* Release firstbatch (will be recycled if we reach _bt_readnextpage) */
+	indexam_util_batch_release(scan, firstbatch);
+
+	if (blkno == P_NONE || !moreInDir)
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
  *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
  *
- * Caller's blkno is the next interesting page's link, taken from either the
- * previously-saved right link or left link.  lastcurrblkno is the page that
- * was current at the point where the blkno link was saved, which we use to
- * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * Caller's blkno is the prior batch's nextPage or prevPage (depending on the
+ * current scan direction), and lastcurrblkno is the prior batch's currPage.
+ * We use lastcurrblkno to reason about concurrent page splits/page deletions
+ * during backwards scans.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns unlocked batch containing data from the next page
+ * that has at least one matching item.  If there are no more matching items
+ * in the given scan direction, we just return NULL.  Note that returning NULL
+ * doesn't necessarily mean the end of the top-level scan; btgetbatch and
+ * btgetbitmap check so->needPrimScan to determine if another primitive index
+ * scan is required.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTBatchData *btnewbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for new batch */
+	newbatch = indexam_util_batch_alloc(scan);
+	btnewbatch = BTBatchGetData(scan, newbatch);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for blkno, a page to the right of
+	 * lastcurrblkno (or to the left, when the scan is moving backwards).
+	 *
+	 * Note: caller's blkno is tentative.  newbatch actually stores matches
+	 * from the next leaf page in this scan direction that has at least one
+	 * matching item.  This is usually caller's blkno page, but might be some
+	 * other page to its right (or to its left) instead.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	btnewbatch->moreLeft = true;	/* for lastcurrblkno (or tentative) */
+	btnewbatch->moreRight = true;	/* tentative (or for lastcurrblkno) */
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			btnewbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+			btnewbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
 														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			if (btnewbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(btnewbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1911,17 +1812,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = btnewbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = btnewbatch->prevPage;
 			}
 		}
 		else
@@ -1936,19 +1837,38 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, btnewbatch->buf);
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !btnewbatch->moreRight : !btnewbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(btnewbatch->currPage == blkno);
+	_bt_batch_unlock(scan, newbatch, btnewbatch->buf);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2174,25 +2094,24 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btfirstbatch = BTBatchGetData(scan, firstbatch);
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	btfirstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(btfirstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2200,10 +2119,10 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
 		_bt_parallel_done(scan);
-		return false;
+		return NULL;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btfirstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2229,9 +2148,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 732bc750c..415e2a1c0 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -19,10 +19,7 @@
 
 #include "access/nbtree.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "storage/lwlock.h"
 #include "utils/datum.h"
@@ -30,7 +27,6 @@
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -145,247 +141,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	return key;
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
-/*
- * _bt_killitems - set LP_DEAD state for items an indexscan caller has
- * told us were killed
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
- *
- * We match items by heap TID before assuming they are the right ones to set
- * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
- * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
- * page, so the page's TIDs can't have been recycled by now.  There's no risk
- * that we'll confuse a new index tuple that happens to use a recycled TID
- * with a now-removed tuple with the same TID (that used to be on this same
- * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
- * the page LSN having not changed since back when _bt_readpage saw the page.
- * We totally give up on setting LP_DEAD bits when the page LSN changed.
- *
- * We give up much less often during !so->dropPin scans, but it still happens.
- * We cope with cases where items have moved right due to insertions.  If an
- * item has moved off the current page due to a split, we'll fail to find it
- * and just give up on it.
- */
-void
-_bt_killitems(IndexScanDesc scan)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber minoff;
-	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
-	bool		killedsomething = false;
-	Buffer		buf;
-
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
-
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
-	{
-		/*
-		 * We have held the pin on this page since we read the index tuples,
-		 * so all we need to do is lock it.  The pin will have prevented
-		 * concurrent VACUUMs from recycling any of the TIDs on the page.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
-		_bt_lockbuf(rel, buf, BT_READ);
-	}
-	else
-	{
-		XLogRecPtr	latestlsn;
-
-		Assert(!BTScanPosIsPinned(so->currPos));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
-		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
-		{
-			/* Modified, give up on hinting */
-			_bt_relbuf(rel, buf);
-			return;
-		}
-
-		/* Unmodified, hinting is safe */
-	}
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
-		OffsetNumber offnum = kitem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
-
-		if (offnum < minoff)
-			continue;			/* pure paranoia */
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-			bool		killtuple = false;
-
-			if (BTreeTupleIsPosting(ituple))
-			{
-				int			pi = i + 1;
-				int			nposting = BTreeTupleGetNPosting(ituple);
-				int			j;
-
-				/*
-				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
-				 * possible that this posting list tuple wasn't a posting list
-				 * tuple when we first encountered its heap TIDs.
-				 */
-				for (j = 0; j < nposting; j++)
-				{
-					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
-
-					if (!ItemPointerEquals(item, &kitem->heapTid))
-						break;	/* out of posting list loop */
-
-					/*
-					 * kitem must have matching offnum when heap TIDs match,
-					 * though only in the common case where the page can't
-					 * have been concurrently modified
-					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
-
-					/*
-					 * Read-ahead to later kitems here.
-					 *
-					 * We rely on the assumption that not advancing kitem here
-					 * will prevent us from considering the posting list tuple
-					 * fully dead by not matching its next heap TID in next
-					 * loop iteration.
-					 *
-					 * If, on the other hand, this is the final heap TID in
-					 * the posting list tuple, then tuple gets killed
-					 * regardless (i.e. we handle the case where the last
-					 * kitem is also the last heap TID in the last index tuple
-					 * correctly -- posting tuple still gets killed).
-					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
-				}
-
-				/*
-				 * Don't bother advancing the outermost loop's int iterator to
-				 * avoid processing killed items that relate to the same
-				 * offnum/posting list tuple.  This micro-optimization hardly
-				 * seems worth it.  (Further iterations of the outermost loop
-				 * will fail to match on this same posting list's first heap
-				 * TID instead, so we'll advance to the next offnum/index
-				 * tuple pretty quickly.)
-				 */
-				if (j == nposting)
-					killtuple = true;
-			}
-			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
-				killtuple = true;
-
-			/*
-			 * Mark index item as dead, if it isn't already.  Since this
-			 * happens while holding a buffer lock possibly in shared mode,
-			 * it's possible that multiple processes attempt to do this
-			 * simultaneously, leading to multiple full-page images being sent
-			 * to WAL (if wal_log_hints or data checksums are enabled), which
-			 * is undesirable.
-			 */
-			if (killtuple && !ItemIdIsDead(iid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(buf))
-						goto unlock_page;
-				}
-
-				/* found the item/all posting list items */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint.
-	 *
-	 * Whenever we mark anything LP_DEAD, we also set the page's
-	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
-	 * only rely on the page-level flag in !heapkeyspace indexes.)
-	 */
-	if (killedsomething)
-	{
-		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		BufferFinishSetHintBits(buf, true, true);
-	}
-
-unlock_page:
-	if (!so->dropPin)
-		_bt_unlockbuf(rel, buf);
-	else
-		_bt_relbuf(rel, buf);
-}
-
-
 /*
  * The following routines manage a shared-memory area in which we track
  * assignment of "vacuum cycle IDs" to currently-active btree vacuuming
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dff7d286f..3bc5e5ccd 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1095,15 +1095,15 @@ btree_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In btree leaf pages, it is possible to modify the LP_FLAGS without
 		 * emitting any WAL record. Hence, mask the line pointer flags. See
-		 * _bt_killitems(), _bt_check_unique() for details.
+		 * btkillitemsbatch(), _bt_check_unique() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * BTP_HAS_GARBAGE is just an un-logged hint bit. So, mask it. See
-	 * _bt_delete_or_dedup_one_page(), _bt_killitems(), and _bt_check_unique()
-	 * for details.
+	 * _bt_delete_or_dedup_one_page(), btkillitemsbatch(), and
+	 * _bt_check_unique() for details.
 	 */
 	maskopaq->btpo_flags &= ~BTP_HAS_GARBAGE;
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index f2ee333f6..33ad43536 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,12 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 97ce81eb5..94eca0181 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -53,9 +53,13 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_fetch_begin != NULL);
 	Assert(routine->index_fetch_reset != NULL);
 	Assert(routine->index_fetch_end != NULL);
+	Assert(routine->index_fetch_batch_init != NULL);
+	Assert(routine->index_plain_amgetbatch_next != NULL);
+	Assert(routine->index_only_amgetbatch_next != NULL);
 	Assert(routine->index_plain_amgettuple_next != NULL);
 	Assert(routine->index_only_amgettuple_next != NULL);
 	Assert(routine->fetch_tid != NULL);
+	Assert(routine->index_fetch_restrpos != NULL);
 
 	Assert(routine->tuple_fetch_row_version != NULL);
 	Assert(routine->tuple_tid_valid != NULL);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 373e82347..8422e65b0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -885,7 +885,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 37fe03fdc..979a852fe 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -429,7 +429,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index f8421a74c..3ff781f2a 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -54,8 +54,8 @@
  *		the inner "5's". This requires repositioning the inner "cursor"
  *		to point at the first inner "5". This is done by "marking" the
  *		first inner 5 so we can restore the "cursor" to it before joining
- *		with the second outer 5. The access method interface provides
- *		routines to mark and restore to a tuple.
+ *		with the second outer 5. The indexbatch.h interface provides
+ *		routines to mark and restore to a tuple during index scans.
  *
  *
  *		Essential operation of the merge join algorithm is as follows:
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 67d9dc35f..edc7e4736 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -43,7 +43,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
@@ -747,7 +747,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
-		if (index->amhasgettuple)
+		if (index->amcanplainscan)
 			add_path(rel, (Path *) ipath);
 
 		if (index->amhasgetbitmap &&
@@ -835,7 +835,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	switch (scantype)
 	{
 		case ST_INDEXSCAN:
-			if (!index->amhasgettuple)
+			if (!index->amcanplainscan)
 				return NIL;
 			break;
 		case ST_BITMAPSCAN:
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7c4be1748..06a2e949d 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -310,11 +310,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amcanplainscan = (amroutine->amgetbatch != NULL ||
+										amroutine->amgettuple != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amgetbatch != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
@@ -411,7 +411,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = false;
 				info->amsearchnulls = false;
 				info->amcanparallel = false;
-				info->amhasgettuple = false;
+				info->amcanplainscan = false;
 				info->amhasgetbitmap = false;
 				info->amcanmarkpos = false;
 				info->amcostestimate = NULL;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 0b1d80b5b..0d0bd468f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -836,6 +836,7 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 {
 	AttrNumber	keycol;
 	oidvector  *indclass;
+	const IndexAmRoutine *amroutine;
 
 	/* The index must not be a partial index */
 	if (!heap_attisnull(idxrel->rd_indextuple, Anum_pg_index_indpred, NULL))
@@ -887,10 +888,12 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 		return false;
 
 	/*
-	 * The given index access method must implement "amgettuple", which will
-	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
+	 * The given index access method must implement "amgettuple" or
+	 * "amgetbatch", which will be used later to fetch the tuples.  See
+	 * RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	amroutine = GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false);
+	if (amroutine->amgettuple == NULL && amroutine->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..476b64e8b 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -146,10 +146,12 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..2b48728c5 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -167,10 +167,12 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amkillitemsbatch_function amkillitemsbatch;	/* can be NULL */
+    amunguardbatch_function amunguardbatch; /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -676,8 +678,38 @@ ambeginscan (Relation indexRelation,
    <emphasis>must</emphasis> create this struct by calling
    <function>RelationGetIndexScan()</function>.  In most cases
    <function>ambeginscan</function> does little beyond making that call and perhaps
-   acquiring locks;
+   acquiring locks and initializing standard <structname>IndexScanDesc</structname> fields;
    the interesting parts of index-scan startup are in <function>amrescan</function>.
+   Index access methods that use the <function>amgetbatch</function> interface
+   must also set the following fields in the scan descriptor:
+   <itemizedlist>
+    <listitem>
+     <para>
+      <literal>scan-&gt;maxitemsbatch</literal>: the maximum number of items
+      that can appear in a single batch (typically derived from the index page
+      size, e.g., <literal>MaxIndexTuplesPerPage</literal>).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_index_opaque_size</literal>: the
+      <function>MAXALIGN</function>'d size of the index AM's per-batch opaque
+      area.  Each batch allocation reserves this much space immediately before
+      the <structname>IndexScanBatchData</structname> pointer, for use by the
+      index AM to store per-page navigation state (e.g., batch index page's
+      buffer pin and sibling page links).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_tuples_workspace</literal>: the size in bytes
+      of the per-batch tuple storage workspace used for index-only scans
+      (typically <literal>BLCKSZ</literal>), or 0 if the index AM does not
+      support index-only scans.  The workspace is accessible via
+      <structfield>batch-&gt;currTuples</structfield>.
+     </para>
+    </listitem>
+   </itemizedlist>
   </para>
 
   <para>
@@ -749,6 +781,237 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>IndexScanBatch</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples in the given
+   scan direction.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface is an alternative to
+   <function>amgettuple</function> that returns matching index entries in batches
+   rather than one at a time.  By returning all matching index entries from a
+   single index page together, the table AM gains visibility into which table
+   blocks will be needed in the near future.
+  </para>
+
+  <para>
+   The table AM passes <literal>priorbatch</literal> to indicate where the
+   index AM should continue scanning from (or <literal>NULL</literal> on the
+   first call for the scan).  The index AM uses information from
+   <literal>priorbatch</literal> to determine which index page to read next.
+   Unlike <function>amgettuple</function>, where the index AM maintains its
+   own scan position, with <function>amgetbatch</function> it is the caller
+   that controls the progress of the scan through the index.  The caller
+   will typically pass the most recently returned batch, but this is not
+   guaranteed &mdash; for example, during mark/restore a previously
+   returned batch may be passed instead.
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is associated with an
+   index page containing at least one matching item/tuple.  A buffer
+   pin can be held onto by the table AM as an interlock against concurrent TID
+   recycling by <command>VACUUM</command>.  The table AM drops this interlock
+   by calling <function>amunguardbatch</function> when it is safe to do so.
+   See <xref linkend="index-locking"/> for details on buffer pin management
+   during index scans.
+  </para>
+
+  <para>
+   A <type>IndexScanBatch</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed (via
+   <function>tableam_util_free_batch</function>).  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.  The core
+   <filename>src/backend/access/nbtree/</filename> implementation provides a
+   reference examples of the <function>amgetbatch</function> interface.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+   Index access methods using <function>amgetbatch</function> must set
+   <literal>scan-&gt;xs_recheck</literal> to indicate whether rechecking of
+   scan keys is required, in the same way as <function>amgettuple</function>
+   does. However, <literal>scan-&gt;xs_recheck</literal> must be set consistently
+   for an entire scan rather than varying on a per-tuple basis. This is a key
+   difference from <function>amgettuple</function>, which can set
+   <literal>scan-&gt;xs_recheck</literal> independently for each tuple it returns.
+   Index access methods that require granular control over
+   <literal>scan-&gt;xs_recheck</literal> must use the <function>amgettuple</function>
+   interface instead of <function>amgetbatch</function>.
+  </para>
+
+  <para>
+   Similarly, the <function>amgetbatch</function> interface does not currently
+   support index-only scans that return data in the form of a
+   <structname>HeapTuple</structname> pointer.  Index-only scans work by
+   copying <structname>IndexTuple</structname> records from index pages into a
+   local buffer associated with each batch.  <literal>xs_itupdesc</literal>
+   works in the same way as already described for <function>amgettuple</function>.
+   The index access method must not set the <literal>scan-&gt;xs_itup</literal>
+   field itself.
+   With <function>amgettuple</function>, the index AM sets
+   <literal>scan-&gt;xs_hitup</literal> to point to a reconstructed
+   <structname>HeapTuple</structname> whose lifetime extends until the next
+   <function>amgettuple</function> call &mdash; only one tuple is valid at a
+   time.  With <function>amgetbatch</function>, multiple batches are held open
+   simultaneously and items are consumed asynchronously by the table AM, so
+   there is no equivalent single-tuple lifetime for per-item
+   <structname>HeapTuple</structname> pointers.  The batch infrastructure
+   provides per-batch storage for <structname>IndexTuple</structname> copies,
+   but has no analogous mechanism for <structname>HeapTuple</structname> data
+   (used by index AMs such as <acronym>GiST</acronym> and
+   <acronym>SP-GiST</acronym> for reconstructed tuples that might not fit in
+   <structname>IndexTuple</structname> format).  This limitation could be
+   addressed in a future version of <productname>PostgreSQL</productname>.
+  </para>
+
+  <para>
+   The index access method must provide either <function>amgettuple</function>
+   or <function>amgetbatch</function>, but not both.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
+void
+amkillitemsbatch (IndexScanDesc scan,
+                  IndexScanBatch batch);
+</programlisting>
+   Called by the table AM when it has finished processing a batch that
+   contains dead items, to set <literal>LP_DEAD</literal> bits in the batch's
+   index page.  The batch's index page will not be locked by the caller; the
+   index AM must acquire and release its own lock (and pin) on the index page.
+  </para>
+
+  <para>
+   While implementing <function>amkillitemsbatch</function> is optional,
+   doing so is recommended for performance, as it allows future scans to skip
+   known-dead index entries.  The core index access method that currently
+   support <function>amgetbatch</function> (B-tree) implements
+   <literal>LP_DEAD</literal> marking, though third-party index access methods
+   are free to choose whether to implement this feature.
+   The table AM may call
+   <function>tableam_util_scanpos_killitem</function> to mark dead items as
+   the scan progresses.  If the batch contains any such dead items, the batch's
+   <structfield>deadItems</structfield> array will have been sorted and
+   deduplicated before <function>amkillitemsbatch</function> is called, with
+   item offsets appearing in ascending order (that is, in index page order,
+   which is also batch order) and no offset appearing more than once.  Index
+   access methods can rely on this ordering when processing dead items: the
+   <structfield>deadItems</structfield> array can be walked in lockstep with
+   the index page's item pointers, since both are in ascending offset order.
+   This also means the table AM need not call
+   <function>tableam_util_scanpos_killitem</function> in any particular order.
+   (Index access methods using <function>amgettuple</function> rely on the
+   <structfield>kill_prior_tuple</structfield> mechanism instead to mark dead
+   tuples; the <filename>src/backend/access/gist/</filename> implementation
+   provides a reference example.)
+  </para>
+
+  <para>
+   When implementing <function>amkillitemsbatch</function>, the index AM
+   should verify that the index page has not been modified since the batch was
+   originally read.  The batch's <structfield>lsn</structfield> field records
+   the page LSN at the time the index page lock was released by
+   <function>indexam_util_batch_unlock</function> (set automatically by the
+   core code, though index AMs whose TID recycling interlock is not just a
+   buffer pin are not obligated to use
+   <function>indexam_util_batch_unlock</function> &mdash; they can implement
+   their own equivalent, and are free to use the batch
+   <structfield>lsn</structfield> field in whatever way they deem
+   necessary).  The index AM should
+   re-read the page, compare the current page LSN against
+   <structfield>batch-&gt;lsn</structfield>, and give up on setting
+   <literal>LP_DEAD</literal> bits if the LSN has advanced.  An advanced LSN
+   indicates that the page was modified &mdash; possibly by
+   <command>VACUUM</command> recycling heap TIDs &mdash; so it would be unsafe
+   to assume that index entries still point to the same heap tuples.  Since
+   <literal>LP_DEAD</literal> marking is only an optimization hint, it is
+   always safe to skip it.  Note that this LSN comparison technique requires
+   the index AM to use fake (monotonically increasing) LSNs on its pages for
+   relations where WAL is not generated, since real LSNs are not available in
+   that case.  See the B-tree index implementation for a reference
+   example of this technique.  An index AM that does not implement fake LSNs
+   can still provide <function>amkillitemsbatch</function>, but should simply
+   do nothing when the relation does not generate WAL (i.e., when
+   <function>RelationNeedsWAL()</function> is false), since the LSN comparison
+   would be unreliable.
+  </para>
+
+  <para>
+   The <function>amkillitemsbatch</function> function is optional.  Index
+   access methods that want to mark dead index tuples with
+   <literal>LP_DEAD</literal> bits should provide it; those that don't can
+   leave it set to <literal>NULL</literal> even when they provide
+   <function>amgetbatch</function>.
+  </para>
+
+  <para>
+<programlisting>
+void
+amunguardbatch (IndexScanDesc scan,
+                IndexScanBatch batch);
+</programlisting>
+   Called by the table AM (via
+   <function>tableam_util_unguard_batch</function>) when it is safe to drop
+   the TID recycling interlock that the index AM holds on the batch's index
+   leaf page, which prevents concurrent TID recycling by
+   <command>VACUUM</command>.
+   Formally, an index AM may hold a different kind of interlock, or multiple
+   interlocks, in its per-batch opaque area, but in practice the built-in
+   index AM that supports <function>amgetbatch</function> &mdash; B-tree
+   &mdash; holds a single buffer pin.  See <xref linkend="index-locking"/>
+   for details on buffer pin management during index scans.  This function
+   will be called exactly once for each guarded batch.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins when this serves an
+   internal purpose (for example, maintaining a descent stack of pinned index
+   pages for reuse across <function>amgetbatch</function> calls).  However,
+   any scheme that retains buffer pins managed by the index AM must be sure to
+   free the pins at an opportune point (for example when <function>amrescan</function>
+   and/or <function>amendscan</function> are called).  It must also keep the
+   number of retained pins fixed and small, to avoid exhausting the backend's
+   buffer pin limit.
+  </para>
+
+  <para>
+   The <function>amunguardbatch</function> function is required for any index
+   access method that provides <function>amgetbatch</function>.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +1031,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -781,6 +1044,29 @@ amgetbitmap (IndexScanDesc scan,
    struct must be set to NULL.
   </para>
 
+  <para>
+   Index access methods that use the <function>amgetbatch</function> interface
+   will generally also want to use the batch allocation infrastructure
+   (<function>indexam_util_batch_alloc</function> and
+   <function>indexam_util_batch_release</function>) within their
+   <function>amgetbitmap</function> implementation.  The convention is that only
+   one batch is allocated at a time during <function>amgetbitmap</function>,
+   unlike <function>amgetbatch</function> where several batches may be
+   outstanding in the batch ring buffer concurrently.  To maintain this
+   one-batch-at-a-time invariant, the index AM itself releases its prior batch
+   via <function>indexam_util_batch_release</function> just as the scan leaves
+   that batch's index page and is about to generate the next batch &mdash; the
+   same point where it extracts navigation state (such as sibling-page links)
+   from <literal>priorbatch</literal>.  This early release is specific to
+   <function>amgetbitmap</function> scans; during <function>amgetbatch</function>
+   scans the <literal>priorbatch</literal> is strictly owned by the table AM
+   and core code, and the index AM must never release it.  See
+   <function>_bt_next</function> for a
+   reference example.  The released batch is cached internally and reused by
+   the next <function>indexam_util_batch_alloc</function> call, avoiding
+   repeated memory allocation during the bitmap scan.
+  </para>
+
   <para>
 <programlisting>
 void
@@ -795,32 +1081,44 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan,
+            IndexScanBatch batch);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify the index AM that the table AM is about to change the scan's
+   logical position in a way that requires the index AM to reset any state
+   that independently tracks the scan's progress.  For example, B-tree must
+   reset the array keys used by <literal>ScalarArrayOpExpr</literal> qual
+   evaluation when the scan position changes.  This callback is invoked when
+   the table AM is about to process a batch in a different direction than
+   was used when the batch was originally returned by
+   <function>amgetbatch</function>, and also when a marked scan position is
+   about to be restored.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   When <function>amposreset</function> is called due to a cross-batch
+   direction change, the core system will have already flipped the batch's
+   <structfield>dir</structfield> field to reflect the new scan direction
+   before making the call.  The index AM should use this updated direction
+   when resetting any state that depends on knowing which way the scan is
+   proceeding.  When called to restore a marked position, the batch's
+   <structfield>dir</structfield> is not modified; it retains the direction
+   from when the batch was originally returned.  In both cases, the batch
+   passed to <function>amposreset</function> is the batch that will be used
+   to continue the scan.
   </para>
 
   <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   Index access methods that have private state which must be reset when the
+   scan position changes must provide an <function>amposreset</function>
+   implementation.  Index AMs with no such state may set
+   <structfield>amposreset</structfield> to NULL.
+   The <function>amposreset</function> function can only be provided when the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  (Note that when <structfield>amcanbackward</structfield> is
+   false, the scan direction cannot change, so
+   <function>amposreset</function> will only be called due to mark/restore
+   in that case.)
   </para>
 
   <para>
@@ -975,6 +1273,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        Access methods that always return entries in the natural ordering
        of their data (such as btree) should set
        <structfield>amcanorder</structfield> to true.
+       Both <function>amgettuple</function> and <function>amgetbatch</function>
+       scans support this capability.
        Currently, such access methods must use btree-compatible strategy
        numbers for their equality and ordering operators.
       </para>
@@ -994,34 +1294,42 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   Note that <function>amgetbatch</function> scans do not currently support
+   ordering operators.  The core executor expects <function>amgettuple</function>
+   to set <structfield>xs_orderbyvals</structfield> for each returned tuple,
+   but there is currently no mechanism to associate per-item ordering values
+   with individual items within a batch.  This would require an additional
+   layer of indirection that does not yet exist, but could be added in a
+   future version of <productname>PostgreSQL</productname>.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function has a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
-   set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   returned batch must be the batch containing the last matching item(s),
+   rather than the batch containing the first matching item(s).
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Scans using the <function>amgetbatch</function> interface support
+   <quote>marking</quote> a position in a scan and later returning to the
+   marked position.  The core executor manages the process of saving and
+   restoring batch positional state without explicit coordinating with the
+   table AM.  However, it will call the index AM's <function>amposreset</function>
+   callback as needed when restoring a mark, to invalidate any index AM state
+   that independently tracks the progress of the scan (such as array key
+   state).  See the description of <function>amposreset</function> in
+   <xref linkend="index-functions"/> for details.
   </para>
 
   <para>
-   Both the scan position and the mark position (if any) must be maintained
+   The scan position (if any) must be maintained by the table AM and index AM
    consistently in the face of concurrent insertions or deletions in the
    index.  It is OK if a freshly-inserted entry is not returned by a scan that
    would have found the entry if it had existed when the scan started, or for
@@ -1044,12 +1352,14 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   Instead of using <function>amgettuple</function>, an index scan can be done with
-   <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
-   noticeably more efficient than <function>amgettuple</function> because it allows
-   avoiding lock/unlock cycles within the access method.  In principle
-   <function>amgetbitmap</function> should have the same effects as repeated
-   <function>amgettuple</function> calls, but we impose several restrictions to
+   Instead of using <function>amgettuple</function> or
+   <function>amgetbatch</function>, an index scan can be done with
+   <function>amgetbitmap</function> to fetch all tuples in one call.  This can
+   be noticeably more efficient than with an <quote>ordered</quote> scan
+   because it allows efficient sequential access to table AM pages containing
+   matches.  In principle <function>amgetbitmap</function> should have the
+   same effects as repeated <function>amgettuple</function> or
+   <function>amgetbatch</function> calls, but we impose several restrictions to
    simplify matters.  First of all, <function>amgetbitmap</function> returns all
    tuples at once and marking or restoring scan positions isn't
    supported. Secondly, the tuples are returned in a bitmap which doesn't
@@ -1059,15 +1369,15 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    Also, there is no provision for index-only scans with
    <function>amgetbitmap</function>, since there is no way to return the contents of
    index tuples.
-   Finally, <function>amgetbitmap</function>
-   does not guarantee any locking of the returned tuples, with implications
-   spelled out in <xref linkend="index-locking"/>.
+   Finally, <function>amgetbitmap</function> does not hold any index page pins
+   after it returns (similarly to <function>amgetbatch</function> scans with
+   an MVCC snapshot), as described in <xref linkend="index-locking"/>.
   </para>
 
   <para>
    Note that it is permitted for an access method to implement only
-   <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
-   if its internal implementation is unsuited to one API or the other.
+   <function>amgetbitmap</function> and not <function>amgettuple</function>/<function>amgetbatch</function>,
+   or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
  </sect1>
@@ -1123,11 +1433,17 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
      </listitem>
      <listitem>
       <para>
-       An index scan must maintain a pin
-       on the index page holding the item last returned by
-       <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
-       entries from pages that are pinned by other backends.  The need
-       for this rule is explained below.
+       A pin must be held on any index page whose items might still need to
+       be followed, and <function>ambulkdelete</function> must acquire a
+       cleanup lock on each index page, which will block if any other
+       backend holds a pin on that page.
+       For <function>amgettuple</function> scans, the index access method
+       manages this pin directly.
+       For <function>amgetbatch</function> scans, the index AM holds a buffer
+       pin on each batch's index leaf page (in its per-batch opaque area),
+       while the table AM controls when the interlock is dropped via
+       <function>amunguardbatch</function>.
+       The need for this rule is explained below.
       </para>
      </listitem>
     </itemizedlist>
@@ -1138,39 +1454,91 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    <command>VACUUM</command>.
    This creates no serious problems if that item
    number is still unused when the reader reaches it, since an empty
-   item slot will be ignored by <function>heap_fetch()</function>.  But what if a
+   item slot will simply be treated as not-visible.  But what if a
    third backend has already re-used the item slot for something else?
    When using an MVCC-compliant snapshot, there is no problem because
    the new occupant of the slot is certain to be too new to pass the
    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
    <literal>SnapshotAny</literal>), it would be possible to accept and return
-   a row that does not in fact match the scan keys.  We could defend
-   against this scenario by requiring the scan keys to be rechecked
-   against the heap row in all cases, but that is too expensive.  Instead,
-   we use a pin on an index page as a proxy to indicate that the reader
-   might still be <quote>in flight</quote> from the index entry to the matching
-   heap entry.  Making <function>ambulkdelete</function> block on such a pin ensures
-   that <command>VACUUM</command> cannot delete the heap entry before the reader
-   is done with it.  This solution costs little in run time, and adds blocking
-   overhead only in the rare cases where there actually is a conflict.
+   a wholly unrelated row (one that does not necessarily satisfy the scan
+   keys).  We can optionally use a pin on an index page as a proxy to indicate
+   that the reader might still be <quote>in flight</quote> from the index
+   entry to the matching heap entry.  Making <function>ambulkdelete</function>
+   block on such a pin ensures that <command>VACUUM</command> cannot delete
+   the heap entry before the reader is done with it.  This solution costs
+   little in run time, and adds blocking overhead only in the rare cases where
+   there actually is a conflict.  When the scan uses an MVCC-compliant
+   snapshot, holding the pin is unnecessary because the snapshot itself will
+   reject any recycled TID's new occupant (see below).
   </para>
 
   <para>
-   This solution requires that index scans be <quote>synchronous</quote>: we have
-   to fetch each heap tuple immediately after scanning the corresponding index
-   entry.  This is expensive for a number of reasons.  An
-   <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
-   and only visit the heap tuples sometime later, requires much less index
-   locking overhead and can allow a more efficient heap access pattern.
-   Per the above analysis, we must use the synchronous approach for
-   non-MVCC-compliant snapshots, but an asynchronous scan is workable
-   for a query using an MVCC snapshot.
+   This solution requires that <function>amgettuple</function> index scans be
+   <quote>synchronous</quote>: the table AM must fetch each heap tuple
+   immediately after scanning the corresponding index entry.  This is
+   expensive for a number of reasons.  The
+   <function>amgetbatch</function> interface, by contrast, was designed to
+   allow scans to be <quote>asynchronous</quote>: by collecting batches of
+   TIDs from multiple index pages, the table AM can prefetch the corresponding
+   table blocks well ahead of the current scan position (using asynchronous
+   I/O when available), allowing a more efficient heap access pattern.  Not
+   all scans end up being asynchronous in practice, but the interface is
+   designed to allow it.  Per the above analysis, we must use the synchronous
+   approach for non-MVCC-compliant snapshots, but an asynchronous scan is
+   workable for a query using an MVCC snapshot.
   </para>
 
   <para>
-   In an <function>amgetbitmap</function> index scan, the access method does not
-   keep an index pin on any of the returned tuples.  Therefore
-   it is only safe to use such scans with MVCC-compliant snapshots.
+   Because the table AM reads multiple index leaf pages ahead via
+   <function>amgetbatch</function> to facilitate this prefetching, it cannot
+   practically hold pins on all those pages simultaneously.  Therefore,
+   I/O prefetching with
+   <function>amgetbatch</function> is only possible when an MVCC-compliant
+   snapshot is in use.
+  </para>
+
+  <para>
+   Whether a batch's TID recycling interlock (typically an index page buffer
+   pin) is dropped immediately or deferred is controlled by a generic,
+   scan-level policy that is determined when the scan is opened &mdash; it is
+   not under the control of either the index AM or the table AM.  The scan's
+   <structfield>batchImmediateUnguard</structfield> flag encodes this policy.
+   It is set based on two criteria that are known to the core scan machinery:
+   whether the scan uses an MVCC-compliant snapshot, and whether it is an
+   index-only scan.  Specifically,
+   <structfield>batchImmediateUnguard</structfield> is true when the scan uses
+   an MVCC snapshot and is <emphasis>not</emphasis> an index-only scan.  When
+   <structfield>batchImmediateUnguard</structfield> is true, the interlock is
+   dropped inside <function>indexam_util_batch_unlock</function> (before the
+   batch is even returned to the table AM), because a plain index scan with an
+   MVCC snapshot will always visit the heap page, where the MVCC visibility
+   check is authoritative &mdash; even if <command>VACUUM</command> recycles a
+   TID, the new occupant cannot pass the snapshot test.  When it is false, the
+   interlock is retained until the table AM explicitly calls
+   <function>amunguardbatch</function>, because the scan cannot rely on that
+   heap page MVCC backstop.  For non-MVCC scans, there is no MVCC snapshot to
+   reject a recycled TID's new occupant at all.  For index-only scans, even
+   with an MVCC snapshot, the scan typically avoids visiting the heap page
+   altogether (using the visibility map instead), so the MVCC check that would
+   catch a recycled TID usually never runs.  In both cases the interlock on
+   the index page is what prevents <command>VACUUM</command> from recycling
+   TIDs while the scan is still in flight.  In all cases, the table AM decides
+   <emphasis>when</emphasis> to call <function>amunguardbatch</function>; the
+   index AM decides <emphasis>what</emphasis> to release.
+  </para>
+
+  <para>
+   Similarly, an <function>amgetbitmap</function> index scan is inherently
+   asynchronous: all matching TIDs are collected into a bitmap before any heap
+   access begins.  Such scans therefore require an MVCC-compliant snapshot,
+   and there is no need for the access method to hold index page pins.
+  </para>
+
+  <para>
+   Index access methods that use <function>amgettuple</function> must manage
+   pin lifetime themselves, since there is no table AM intermediary (unlike
+   with <function>amgetbatch</function>).  The index AM must hold a pin on the
+   current index page until the scan moves to a different page or ends.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 80829b239..cf0ba5b0c 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1173,12 +1173,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 31f8d2b81..e2b865fb4 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -334,10 +334,12 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amkillitemsbatch = NULL,
+		.amunguardbatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5bc517602..23e043cc0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -206,6 +206,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -233,8 +234,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -263,6 +262,9 @@ BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
 BatchMVCCState
+BatchMatchingItem
+BatchRingBuffer
+BatchRingItemPos
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
@@ -1238,6 +1240,7 @@ HbaLine
 HeadlineJsonState
 HeadlineParsedText
 HeadlineWordEntry
+HeapBatchData
 HeapCheckContext
 HeapCheckReadStreamData
 HeapPageFreeze
@@ -1317,6 +1320,8 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatch
+IndexScanBatchData
 IndexScanDesc
 IndexScanDescData
 IndexScanInstrumentation
@@ -3531,18 +3536,17 @@ amcanreturn_function
 amcostestimate_function
 amendscan_function
 amestimateparallelscan_function
+amgetbatch_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.53.0



  [application/octet-stream] v20-0007-heapam-Optimize-pin-transfers-during-index-scans.patch (6.6K, 15-v20-0007-heapam-Optimize-pin-transfers-during-index-scans.patch)
  download | inline diff:
From 39eab4e298471bc22376549f1b2a638fd766d488 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 22 Mar 2026 02:22:06 -0400
Subject: [PATCH v20 07/17] heapam: Optimize pin transfers during index scans.

Add an xs_lastinblock flag to IndexFetchHeapData that tracks whether the
current TID is the last one on its heap block within the current batch.
When it is, heapam_index_fetch_tuple can transfer its buffer pin to the
slot (via ExecStorePinnedBufferHeapTuple) instead of incrementing the
pin count, saving a pair of IncrBufferRefCount/ReleaseBuffer calls.
This optimization is not used for index-only scans because all-visible
items can be skipped, which would break block deduplication symmetry
between the scan and the read stream (besides, the performance of this
code path can only matter when many heap fetches are required, which is
hopefully rare during index-only scans).

Also add an explicit ExecClearTuple to the block-switch path in
heapam_index_fetch_tuple_impl to release the pin transferred to the slot
on the previous call (just before calling ReleaseBuffer as part of
moving on to the scan's next block).  This fixes a performance problem
where the code path in question triggers GetPrivateRefCountEntrySlow
calls more often than one would hope.  The underlying issue has been
tied to the pin in the slot being held, even if we decide to release the
buffer and move on: ExecStoreBufferHeapTuple will first fail to hit the
backend-local cache for the release of the old pin (because we just
pinned and locked the new buffer), causing a cache miss inside
IncrBufferRefCount.

Author: Peter Geoghegan <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com
---
 src/include/access/heapam.h                |  3 +
 src/backend/access/heap/heapam_indexscan.c | 73 +++++++++++++++++++++-
 2 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1c5570ac0..c3bb89538 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -132,6 +132,9 @@ typedef struct IndexFetchHeapData
 	Buffer		xs_vmbuffer;	/* visibility map buffer */
 	int			xs_vm_items;	/* # items to resolve visibility info for */
 
+	/* Plain index scan xs_lastinblock optimization */
+	bool		xs_lastinblock; /* last TID on this block in current batch? */
+
 } IndexFetchHeapData;
 
 /*
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 885c25c67..7a6b49ee5 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -78,6 +78,9 @@ heapam_index_fetch_begin(Relation rel, uint32 flags)
 	Assert(hscan->xs_vmbuffer == InvalidBuffer);
 	hscan->xs_vm_items = 1;
 
+	/* xs_lastinblock optimization state */
+	Assert(!hscan->xs_lastinblock);
+
 	/*
 	 * Return opaque state, which we'll access through the scan's xs_heapfetch
 	 * field later on.
@@ -445,6 +448,14 @@ heapam_index_fetch_tuple_impl(Relation rel,
 		/* Remember this buffer's block number for next time */
 		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
+		/*
+		 * Drop the xs_blk pin independently held on by slot (if any) now,
+		 * before calling ReleaseBuffer.  This avoids expensive calls to
+		 * GetPrivateRefCountEntrySlow caused by ExecStoreBufferHeapTuple
+		 * failing to hit the backend's cache for the release of the old pin.
+		 */
+		ExecClearTuple(slot);
+
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
@@ -482,7 +493,33 @@ heapam_index_fetch_tuple_impl(Relation rel,
 		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
 
 		slot->tts_tableOid = RelationGetRelid(rel);
-		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
+
+		/*
+		 * If this is the last TID on the current heap block within the batch,
+		 * transfer our buffer pin to the slot rather than having the slot
+		 * increment the pin count.  This saves a pair of IncrBufferRefCount
+		 * and ReleaseBuffer calls, since the caller would just release its
+		 * pin on xs_cbuf when switching to the next block anyway.
+		 *
+		 * We can only do this when heap_continue is false, since otherwise
+		 * the caller will need xs_cbuf to remain valid for the next call.
+		 */
+		if (hscan->xs_lastinblock && !*heap_continue)
+		{
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata, slot,
+										   hscan->xs_cbuf);
+			hscan->xs_cbuf = InvalidBuffer;
+			hscan->xs_blk = InvalidBlockNumber;
+
+			/*
+			 * Note: the pin now owned by the slot is expected to be released
+			 * on the next call here, via an explicit ExecClearTuple.  This
+			 * avoids churn in the backend's private refcount cache.
+			 */
+		}
+		else
+			ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot,
+									 hscan->xs_cbuf);
 	}
 	else
 	{
@@ -795,10 +832,42 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
 
 	if (all_visible == NULL)
 	{
+		int			nextItem;
+		bool		hasNext;
+
 		/*
 		 * Plain index scan.
+		 *
+		 * Determine if the next item in the current scan direction is on a
+		 * different heap block.  When it is, heapam_index_fetch_tuple_impl
+		 * can transfer its buffer pin to the slot instead of incrementing the
+		 * pin count, saving a pair of IncrBufferRefCount/ReleaseBuffer calls.
+		 *
+		 * Note: We cannot do this for index-only scans because all-visible
+		 * items are skipped by both the scan and the read stream callback. It
+		 * doesn't seem worth the trouble of reasoning about these issues,
+		 * since the optimization only helps when heap fetches are required.
+		 *
+		 * Note: We deliberately don't consider the batch after scanBatch,
+		 * because doing so would add complexity for little benefit.  It's
+		 * okay if xs_lastinblock is spuriously set to false.
 		 */
 		Assert(!scan->xs_want_itup);
+		if (ScanDirectionIsForward(direction))
+		{
+			nextItem = scanPos->item + 1;
+			hasNext = (nextItem <= scanBatch->lastItem);
+		}
+		else
+		{
+			nextItem = scanPos->item - 1;
+			hasNext = (nextItem >= scanBatch->firstItem);
+		}
+
+		hscan->xs_lastinblock = hasNext &&
+			ItemPointerGetBlockNumber(&scanBatch->items[nextItem].tableTid) !=
+			ItemPointerGetBlockNumber(&scan->xs_heaptid);
+
 		return &scan->xs_heaptid;
 	}
 
@@ -807,7 +876,7 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
 	 *
 	 * Also set xs_itup, which caller also needs.
 	 */
-	Assert(scan->xs_want_itup);
+	Assert(scan->xs_want_itup && !hscan->xs_lastinblock);
 	scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
 								  scanBatch->items[scanPos->item].tupleOffset);
 
-- 
2.53.0



  [application/octet-stream] v20-0004-heapam-Track-heap-block-in-IndexFetchHeapData.patch (4.6K, 16-v20-0004-heapam-Track-heap-block-in-IndexFetchHeapData.patch)
  download | inline diff:
From e159d0a4bd3571911f7c6787cde51da993a4a7a7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 10 Mar 2026 14:40:35 -0400
Subject: [PATCH v20 04/17] heapam: Track heap block in IndexFetchHeapData.

Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that
tracks which heap block is currently pinned in xs_cbuf.

heapam_index_fetch_tuple now uses xs_blk to determine when buffer
switching is needed, replacing the previous approach that compared
buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call.

This is preparatory work for an upcoming commit that will add index
prefetching using a read stream.  Delegating the release of a currently
pinned buffer to ReleaseAndReadBuffer won't work anymore -- at least not
when the next buffer that the scan needs to pin is one returned by
read_stream_next_buffer (not a buffer returned by ReadBuffer).

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
---
 src/include/access/heapam.h                |  5 ++--
 src/backend/access/heap/heapam_indexscan.c | 35 ++++++++++++++--------
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d76d3e7fd..a78fc0df2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -122,10 +122,11 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	/*
-	 * Current heap buffer in scan, if any. NB: if xs_cbuf is not
-	 * InvalidBuffer, we hold a pin on that buffer.
+	 * Current heap buffer in scan (and its block number), if any.  NB: if
+	 * xs_blk is not InvalidBlockNumber, we hold a pin in xs_cbuf.
 	 */
 	Buffer		xs_cbuf;
+	BlockNumber xs_blk;
 
 	/* Current heap block's corresponding page in the visibility map */
 	Buffer		xs_vmbuffer;
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 23635b6bc..459b69eee 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -50,6 +50,7 @@ heapam_index_fetch_begin(Relation rel, uint32 flags)
 
 	hscan->xs_base.flags = flags;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 	hscan->xs_vmbuffer = InvalidBuffer;
 
 	/*
@@ -68,6 +69,7 @@ heapam_index_fetch_reset(IndexScanDesc scan)
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 
 	if (BufferIsValid(hscan->xs_vmbuffer))
@@ -324,23 +326,30 @@ heapam_index_fetch_tuple_impl(Relation rel,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*heap_continue)
+	/* We can skip the buffer-switching logic if we're on the same page. */
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*heap_continue);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf, rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
+
+		hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page when it is pinned for the first time
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(rel, hscan->xs_cbuf, &hscan->xs_vmbuffer,
-								hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
+		heap_page_prune_opt(rel, hscan->xs_cbuf,
+							&hscan->xs_vmbuffer,
+							hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
 	}
 
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+	Assert(hscan->xs_blk == ItemPointerGetBlockNumber(tid));
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -443,11 +452,11 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 					 */
 					if (unlikely(scan->xs_visited_pages_limit > 0))
 					{
-						BlockNumber blk = ItemPointerGetBlockNumber(tid);
+						Assert(hscan->xs_blk == ItemPointerGetBlockNumber(tid));
 
-						if (blk != last_visited_block)
+						if (hscan->xs_blk != last_visited_block)
 						{
-							last_visited_block = blk;
+							last_visited_block = hscan->xs_blk;
 							if (++n_visited_pages > scan->xs_visited_pages_limit)
 								return false;	/* give up */
 						}
-- 
2.53.0



  [application/octet-stream] v20-0003-Add-slot-based-table-AM-index-scan-interface.patch (75.4K, 17-v20-0003-Add-slot-based-table-AM-index-scan-interface.patch)
  download | inline diff:
From 9faede5411de068587c64489f9ab95bc70335213 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 22 Mar 2026 02:36:57 -0400
Subject: [PATCH v20 03/17] Add slot-based table AM index scan interface.

Add table_index_getnext_slot, a new table AM callback that wraps both
plain and index-only index scans that use amgettuple.  Two new
TableAmRoutine callbacks are introduced -- one for plain scans and one
for index-only scans -- which an upcoming commit that adds the
amgetbatch interface will expand to four.  The appropriate callback is
resolved once in index_beginscan, and called through a function pointer
(xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot
shim function is called from executor nodes.

This moves VM checks for index-only scans out of the executor and into
heapam, enabling batching of visibility map lookups (though for now we
continue to just perform retail lookups).  Using the new higher level
slot-based interface greatly simplifies nodeIndexonlyscan.c, which no
longer has to deal with the visibility map directly.  More importantly,
this is a significantly architectural improvement: table AMs can now
implement index-only scans that are not tied to heapam's visibility map.

A small minority of callers (2 callers in total) fundamentally need to
pass a TID to the table AM (both perform constraint enforcement).  These
callers don't actually perform index scans (even if their TIDs are taken
from an index), and have no need for most of the index scan machinery.
Switch these callers over to the new fetch_tid interface (which replaces
the previous TID-based index_fetch_tuple interface).  All index scan
callers now use the new slot-based interface (table_index_getnext_slot).

Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
(which the table AM needs internally for heap fetches), but continue to
read their results from IndexScanDescData fields such as xs_itup (rather
than from the slot itself).  All callers can continue to rely on the
scan descriptor's xs_heaptid field being set on each call.

The VISITED_PAGES_LIMIT mechanism used by get_actual_variable_range to
cap scan overhead during planning is reworked to go through a new scan
descriptor interface (xs_visited_pages_limit), rather than tracking the
costs directly and terminating the scan itself, in an ad-hoc way.  This
is necessary because callers that use the new slot-based interface no
longer have direct access to which heap blocks were fetched by the table
AM.  Similarly, nodeIndexonlyscan.c can no longer use InstrCountTuples2
to count heap fetches during an EXPLAIN ANALYZE.  Instead it relies on
heapam maintaining a new IndexScanInstrumentation.ntablefetches field.

Though independently useful, this commit is preparatory work for an
upcoming commit that will add an amgetbatch index AM interface, where
the table AM takes full responsibility for managing the progress of
index scans.  That will move most of the implementation of scrollable
cursors out of index AMs and into table AMs, making it essential that
executor nodes pass the current scan direction down to the table AM.

The heapam implementations make aggressive use of forced inlining to
ensure that plain and index-only code paths are fully specialized at
compile time despite sharing a common implementation.  Testing has shown
this is necessary to keep icache misses to a minimum, at least with the
two upcoming amgetbatch variants.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/genam.h                 |   5 +-
 src/include/access/heapam.h                |  16 +-
 src/include/access/relscan.h               |  31 ++-
 src/include/access/tableam.h               | 152 +++++++----
 src/include/executor/instrument_node.h     |   5 +
 src/include/nodes/execnodes.h              |   2 -
 src/backend/access/heap/heapam_handler.c   |  11 +-
 src/backend/access/heap/heapam_indexscan.c | 286 +++++++++++++++++++--
 src/backend/access/heap/visibilitymap.c    |  27 +-
 src/backend/access/index/genam.c           |  11 +-
 src/backend/access/index/indexam.c         | 213 +++++----------
 src/backend/access/nbtree/nbtinsert.c      |  10 +-
 src/backend/access/table/tableam.c         |  23 +-
 src/backend/access/table/tableamapi.c      |   4 +-
 src/backend/commands/constraint.c          |  28 +-
 src/backend/commands/explain.c             |  23 +-
 src/backend/executor/execIndexing.c        |   8 +-
 src/backend/executor/execReplication.c     |  12 +-
 src/backend/executor/nodeBitmapIndexscan.c |   1 +
 src/backend/executor/nodeIndexonlyscan.c   | 109 +-------
 src/backend/executor/nodeIndexscan.c       |  13 +-
 src/backend/utils/adt/ri_triggers.c        |   4 +-
 src/backend/utils/adt/selfuncs.c           |  61 +----
 23 files changed, 586 insertions(+), 469 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b69320a7f..db62e0ca1 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -156,6 +156,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool index_only_scan,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys,
@@ -183,15 +184,13 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool index_only_scan,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan,
 											  uint32 flags);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index cc90c821b..d76d3e7fd 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -430,15 +430,19 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
 
 /* in heap/heapam_indexscan.c */
 extern IndexFetchTableData *heapam_index_fetch_begin(Relation rel, uint32 flags);
-extern void heapam_index_fetch_reset(IndexFetchTableData *scan);
-extern void heapam_index_fetch_end(IndexFetchTableData *scan);
+extern void heapam_index_fetch_reset(IndexScanDesc scan);
+extern void heapam_index_fetch_end(IndexScanDesc scan);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
 								   bool *all_dead, bool first_call);
-extern bool heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
-									 ItemPointer tid, Snapshot snapshot,
-									 TupleTableSlot *slot, bool *heap_continue,
-									 bool *all_dead);
+extern bool heapam_index_plain_amgettuple_next(IndexScanDesc scan,
+											   ScanDirection direction,
+											   TupleTableSlot *slot);
+extern bool heapam_index_only_amgettuple_next(IndexScanDesc scan,
+											  ScanDirection direction,
+											  TupleTableSlot *slot);
+extern bool heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot,
+							 TupleTableSlot *slot, bool *all_dead);
 
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 960abf6c2..0ff158d5d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/relfilelocator.h"
@@ -24,6 +25,7 @@
 
 
 struct ParallelTableScanDescData;
+struct TupleTableSlot;
 
 /*
  * Generic descriptor for table scans. This is the base-class for table scans,
@@ -117,12 +119,13 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 /*
  * Base class for fetches from a table via an index. This is the base-class
  * for such scans, which needs to be embedded in the respective struct for
- * individual AMs.
+ * individual table AMs.
+ *
+ * This is essentially the table AM specific portion of IndexScanDescData,
+ * accessed through its xs_heapfetch field.
  */
 typedef struct IndexFetchTableData
 {
-	Relation	rel;
-
 	/*
 	 * Bitmask of ScanOptions affecting the relation. No SO_INTERNAL_FLAGS are
 	 * permitted.
@@ -166,10 +169,10 @@ typedef struct IndexScanDescData
 	struct IndexScanInstrumentation *instrument;
 
 	/*
-	 * In an index-only scan, a successful amgettuple call must fill either
-	 * xs_itup (and xs_itupdesc) or xs_hitup (and xs_hitupdesc) to provide the
-	 * data returned by the scan.  It can fill both, in which case the heap
-	 * format will be used.
+	 * In an index-only scan, a successful table_index_getnext_slot call must
+	 * fill either xs_itup (and xs_itupdesc) or xs_hitup (and xs_hitupdesc) to
+	 * provide the data returned by the scan.  It can fill both, in which case
+	 * the heap format will be used.
 	 */
 	IndexTuple	xs_itup;		/* index tuple returned by AM */
 	struct TupleDescData *xs_itupdesc;	/* rowtype descriptor of xs_itup */
@@ -181,6 +184,11 @@ typedef struct IndexScanDescData
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
 
+	/* Resolved index_*_next implementation, set by index_beginscan */
+	bool		(*xs_getnext_slot) (struct IndexScanDescData *scan,
+									ScanDirection direction,
+									struct TupleTableSlot *slot);
+
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
 	/*
@@ -194,6 +202,13 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/*
+	 * An approximate limit on the amount of work, measured in pages touched,
+	 * imposed on the index scan.  The default, 0, means no limit.  Used by
+	 * selfuncs.c to bound the cost of get_actual_variable_endpoint().
+	 */
+	uint8		xs_visited_pages_limit;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 } IndexScanDescData;
@@ -208,8 +223,6 @@ typedef struct ParallelIndexScanDescData
 	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
 }			ParallelIndexScanDescData;
 
-struct TupleTableSlot;
-
 /* Struct for storage-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4647785fd..4875d70ad 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/genam.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -450,46 +451,59 @@ typedef struct TableAmRoutine
 	 * flags is a bitmask of ScanOptions affecting underlying table scan
 	 * behavior. See scan_begin() for more information on passing these.
 	 *
-	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 * Tuples for an index scan can then be fetched via one of the
+	 * slot-based callbacks called through table_index_getnext_slot.
 	 */
 	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
 
 	/*
-	 * Reset index fetch. Typically this will release cross index fetch
-	 * resources held in IndexFetchTableData.
+	 * Reset index fetch for a rescan.  Releases cross-fetch resources held in
+	 * IndexFetchTableData.
 	 */
-	void		(*index_fetch_reset) (struct IndexFetchTableData *data);
+	void		(*index_fetch_reset) (IndexScanDesc scan);
 
 	/*
 	 * Release resources and deallocate index fetch.
 	 */
-	void		(*index_fetch_end) (struct IndexFetchTableData *data);
+	void		(*index_fetch_end) (IndexScanDesc scan);
+
+	/*
+	 * Fetch the next tuple from an index scan, scanning in the specified
+	 * direction, and return true if a tuple was found, false otherwise.
+	 *
+	 * Two variants cover {plain, index-only} index scans that use amgettuple.
+	 * index_beginscan resolves which variant to use.  Callers use
+	 * table_index_getnext_slot(), which calls through that pointer directly.
+	 */
+	bool		(*index_plain_amgettuple_next) (IndexScanDesc scan,
+												ScanDirection direction,
+												TupleTableSlot *slot);
+	bool		(*index_only_amgettuple_next) (IndexScanDesc scan,
+											   ScanDirection direction,
+											   TupleTableSlot *slot);
 
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback for single-shot TID lookups used by
+	 * constraint enforcement code (unique checks and similar).
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
 	 * older version of the tuple.
 	 *
-	 * *call_again is false on the first call to index_fetch_tuple for a tid.
-	 * If there potentially is another tuple matching the tid, *call_again
-	 * needs to be set to true by index_fetch_tuple, signaling to the caller
-	 * that index_fetch_tuple should be called again for the same tid.
-	 *
-	 * *all_dead, if all_dead is not NULL, should be set to true by
-	 * index_fetch_tuple iff it is guaranteed that no backend needs to see
-	 * that tuple. Index AMs can use that to avoid returning that tid in
-	 * future searches.
+	 * *all_dead, if all_dead is not NULL, should be set to true by fetch_tid
+	 * iff it is guaranteed that no backend needs to see that tuple. Index AMs
+	 * can use that to avoid returning that tid in future searches.
 	 */
-	bool		(*index_fetch_tuple) (struct IndexFetchTableData *scan,
-									  ItemPointer tid,
-									  Snapshot snapshot,
-									  TupleTableSlot *slot,
-									  bool *call_again, bool *all_dead);
+	bool		(*fetch_tid) (Relation rel,
+							  ItemPointer tid,
+							  Snapshot snapshot,
+							  TupleTableSlot *slot,
+							  bool *all_dead);
 
 
 	/* ------------------------------------------------------------------------
@@ -1235,7 +1249,7 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  *
  * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
  *
- * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
+ * Tuples for an index scan can then be fetched via table_index_getnext_slot().
  */
 static inline IndexFetchTableData *
 table_index_fetch_begin(Relation rel, uint32 flags)
@@ -1255,39 +1269,63 @@ table_index_fetch_begin(Relation rel, uint32 flags)
 
 /*
  * Reset index fetch. Typically this will release cross index fetch resources
- * held in IndexFetchTableData.
+ * held in the scan's underlying IndexFetchTableData.
  */
 static inline void
-table_index_fetch_reset(struct IndexFetchTableData *scan)
+table_index_fetch_reset(IndexScanDesc scan)
 {
-	scan->rel->rd_tableam->index_fetch_reset(scan);
+	Assert(scan->xs_heapfetch);
+
+	scan->heapRelation->rd_tableam->index_fetch_reset(scan);
 }
 
 /*
- * Release resources and deallocate index fetch.
+ * Release resources and deallocate index fetch held in the scan's underlying
+ * IndexFetchTableData.
  */
 static inline void
-table_index_fetch_end(struct IndexFetchTableData *scan)
+table_index_fetch_end(IndexScanDesc scan)
 {
-	scan->rel->rd_tableam->index_fetch_end(scan);
+	Assert(scan->xs_heapfetch);
+
+	scan->heapRelation->rd_tableam->index_fetch_end(scan);
 }
 
 /*
- * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
- * a visibility test according to `snapshot`. If a tuple was found and passed
- * the visibility test, returns true, false otherwise. Note that *tid may be
- * modified when we return true (see later remarks on multiple row versions
- * reachable via a single index entry).
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction.  Returns true if a tuple satisfying the scan keys and
+ * the snapshot was found, false otherwise.  The tuple is stored in the
+ * specified slot.
  *
- * *call_again needs to be false on the first call to table_index_fetch_tuple() for
- * a tid. If there potentially is another tuple matching the tid, *call_again
- * will be set to true, signaling that table_index_fetch_tuple() should be called
- * again for the same tid.
+ * Dispatches through scan->xs_getnext_slot, which is resolved once by
+ * index_beginscan.
  *
- * *all_dead, if all_dead is not NULL, will be set to true by
- * table_index_fetch_tuple() iff it is guaranteed that no backend needs to see
- * that tuple. Index AMs can use that to avoid returning that tid in future
- * searches.
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * released by a future table_index_getnext_slot or index_endscan call.
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ *
+ * For index-only scans, the callback also fills xs_itup/xs_itupdesc or
+ * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without
+ * a heap fetch.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	Assert(scan->xs_heapfetch);
+
+	return scan->xs_getnext_slot(scan, direction, slot);
+}
+
+/*
+ * Fetch tuple at `tid` into `slot`, after doing a visibility test according
+ * to `snapshot`. If a tuple was found and passed the visibility test, returns
+ * true, false otherwise. Note that *tid may be modified when we return true
+ * (see later remarks on multiple row versions reachable via a single index
+ * entry).
  *
  * The difference between this function and table_tuple_fetch_row_version()
  * is that this function returns the currently visible version of a row if
@@ -1295,29 +1333,35 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * *all_dead, if all_dead is not NULL, will be set to true by
+ * table_fetch_tid() iff it is guaranteed that no backend needs to see that
+ * tuple. Index AMs can use that to avoid returning that tid in future
+ * searches.
+ *
+ * This is a lower-level interface for single-shot TID lookups used by
+ * constraint enforcement code.
  */
 static inline bool
-table_index_fetch_tuple(struct IndexFetchTableData *scan,
-						ItemPointer tid,
-						Snapshot snapshot,
-						TupleTableSlot *slot,
-						bool *call_again, bool *all_dead)
+table_fetch_tid(Relation rel,
+				ItemPointer tid,
+				Snapshot snapshot,
+				TupleTableSlot *slot,
+				bool *all_dead)
 {
-	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
-													slot, call_again,
-													all_dead);
+	return rel->rd_tableam->fetch_tid(rel, tid, snapshot, slot, all_dead);
 }
 
 /*
- * This is a convenience wrapper around table_index_fetch_tuple() which
+ * This is a convenience wrapper around table_fetch_tid() which
  * returns whether there are table tuple items corresponding to an index
  * entry.  This likely is only useful to verify if there's a conflict in a
  * unique index.
  */
-extern bool table_index_fetch_tuple_check(Relation rel,
-										  ItemPointer tid,
-										  Snapshot snapshot,
-										  bool *all_dead);
+extern bool table_fetch_tid_check(Relation rel,
+								  ItemPointer tid,
+								  Snapshot snapshot,
+								  bool *all_dead);
 
 
 /* ------------------------------------------------------------------------
@@ -1331,8 +1375,8 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  * `snapshot`. If a tuple was found and passed the visibility test, returns
  * true, false otherwise.
  *
- * See table_index_fetch_tuple's comment about what the difference between
- * these functions is. It is correct to use this function outside of index
+ * See table_fetch_tid's comment about what the difference between these
+ * functions is. It is correct to use this function outside of index
  * entry->table tuple lookups.
  */
 static inline bool
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 2a0ff377a..ccc07bcda 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -48,6 +48,11 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * table blocks fetched count (incremented during index-only scans)
+	 */
+	uint64		ntablefetches;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 090cfccf6..0b18e74ca 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1785,7 +1785,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1808,7 +1807,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation *ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 07f07188d..f96b42709 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -657,8 +657,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0,
-									SO_NONE);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0, SO_NONE);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -696,7 +696,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!table_index_getnext_slot(indexScan, ForwardScanDirection,
+										  slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2556,7 +2557,9 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
-	.index_fetch_tuple = heapam_index_fetch_tuple,
+	.index_plain_amgettuple_next = heapam_index_plain_amgettuple_next,
+	.index_only_amgettuple_next = heapam_index_only_amgettuple_next,
+	.fetch_tid = heapam_fetch_tid,
 
 	.tuple_insert = heapam_tuple_insert,
 	.tuple_insert_speculative = heapam_tuple_insert_speculative,
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index c36b804d1..23635b6bc 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -14,11 +14,30 @@
  */
 #include "postgres.h"
 
+#include "access/amapi.h"
 #include "access/heapam.h"
 #include "access/relscan.h"
+#include "access/visibilitymap.h"
 #include "storage/predicate.h"
+#include "utils/pgstat_internal.h"
 
 
+static pg_attribute_always_inline bool heapam_index_fetch_tuple_impl(Relation rel,
+																	 IndexFetchHeapData *hscan,
+																	 ItemPointer tid,
+																	 Snapshot snapshot,
+																	 TupleTableSlot *slot,
+																	 bool *heap_continue,
+																	 bool *all_dead);
+static pg_attribute_always_inline bool heapam_index_getnext_slot(IndexScanDesc scan,
+																 ScanDirection direction,
+																 TupleTableSlot *slot,
+																 bool index_only);
+static pg_attribute_always_inline bool heapam_index_fetch_heap(IndexScanDesc scan,
+															   IndexFetchHeapData *hscan,
+															   TupleTableSlot *slot,
+															   bool *heap_continue);
+
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
  * ------------------------------------------------------------------------
@@ -29,18 +48,21 @@ heapam_index_fetch_begin(Relation rel, uint32 flags)
 {
 	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
 
-	hscan->xs_base.rel = rel;
 	hscan->xs_base.flags = flags;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_vmbuffer = InvalidBuffer;
 
+	/*
+	 * Return opaque state, which we'll access through the scan's xs_heapfetch
+	 * field later on
+	 */
 	return &hscan->xs_base;
 }
 
 void
-heapam_index_fetch_reset(IndexFetchTableData *scan)
+heapam_index_fetch_reset(IndexScanDesc scan)
 {
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
@@ -56,9 +78,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 }
 
 void
-heapam_index_fetch_end(IndexFetchTableData *scan)
+heapam_index_fetch_end(IndexScanDesc scan)
 {
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
 
 	heapam_index_fetch_reset(scan);
 
@@ -227,14 +249,76 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	return false;
 }
 
-bool
-heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
-						 ItemPointer tid,
-						 Snapshot snapshot,
-						 TupleTableSlot *slot,
-						 bool *heap_continue, bool *all_dead)
+/* table_index_getnext_slot callback: amgettuple, plain index scan */
+pg_attribute_hot bool
+heapam_index_plain_amgettuple_next(IndexScanDesc scan,
+								   ScanDirection direction,
+								   TupleTableSlot *slot)
+{
+	Assert(!scan->xs_want_itup);
+	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, false);
+}
+
+/* table_index_getnext_slot callback: amgettuple, index-only scan */
+pg_attribute_hot bool
+heapam_index_only_amgettuple_next(IndexScanDesc scan,
+								  ScanDirection direction,
+								  TupleTableSlot *slot)
+{
+	Assert(scan->xs_want_itup);
+	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, true);
+}
+
+/*
+ * Simple, single-shot TID lookup for constraint enforcement code (unique
+ * checks and similar).  This is essentially just a heap_hot_search_buffer
+ * wrapper.
+ *
+ * This doesn't actually perform index scans.  But this is just as good a
+ * place for it as any other.
+ */
+bool
+heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot,
+				 TupleTableSlot *slot, bool *all_dead)
+{
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	Buffer		buf;
+	bool		found;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
+
+	buf = ReadBuffer(rel, ItemPointerGetBlockNumber(tid));
+
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+	found = heap_hot_search_buffer(tid, rel, buf, snapshot,
+								   &bslot->base.tupdata, all_dead, true);
+	bslot->base.tupdata.t_self = *tid;
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (found)
+	{
+		slot->tts_tableOid = RelationGetRelid(rel);
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata, slot,
+									   buf);
+	}
+	else
+		ReleaseBuffer(buf);
+
+	return found;
+}
+
+static pg_attribute_always_inline bool
+heapam_index_fetch_tuple_impl(Relation rel,
+							  IndexFetchHeapData *hscan,
+							  ItemPointer tid,
+							  Snapshot snapshot,
+							  TupleTableSlot *slot,
+							  bool *heap_continue, bool *all_dead)
 {
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	bool		got_heap_tuple;
 
@@ -246,23 +330,21 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
+		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf, rel,
 											  ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
-								&hscan->xs_vmbuffer,
+			heap_page_prune_opt(rel, hscan->xs_cbuf, &hscan->xs_vmbuffer,
 								hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
-											hscan->xs_base.rel,
+											rel,
 											hscan->xs_cbuf,
 											snapshot,
 											&bslot->base.tupdata,
@@ -279,7 +361,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 */
 		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
 
-		slot->tts_tableOid = RelationGetRelid(scan->rel);
+		slot->tts_tableOid = RelationGetRelid(rel);
 		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
 	}
 	else
@@ -290,3 +372,171 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	return got_heap_tuple;
 }
+
+/*
+ * Common implementation for both heapam_index_*_getnext_slot variants.
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to heapam_index_fetch_end
+ * through index_endscan).
+ *
+ * The index_only parameter is a compile-time constant at each call site,
+ * allowing the compiler to specialize the code for each variant.
+ */
+static pg_attribute_always_inline bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot, bool index_only)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	bool	   *heap_continue = &scan->xs_heap_continue;
+	bool		all_visible = false;
+	BlockNumber last_visited_block = InvalidBlockNumber;
+	uint8		n_visited_pages = 0;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!*heap_continue)
+		{
+			/* Get the next TID from the index */
+			tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			/* For index-only scans, check the visibility map */
+			if (index_only)
+				all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+											 ItemPointerGetBlockNumber(tid),
+											 &hscan->xs_vmbuffer);
+		}
+
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+		if (index_only)
+		{
+			/*
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 */
+			if (!all_visible)
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->ntablefetches++;
+
+				if (!heapam_index_fetch_heap(scan, hscan, slot,
+											 heap_continue))
+				{
+					/*
+					 * No visible tuple.  If caller set a visited-pages limit
+					 * (only selfuncs.c does this), count distinct heap pages
+					 * and give up once we've visited too many.
+					 */
+					if (unlikely(scan->xs_visited_pages_limit > 0))
+					{
+						BlockNumber blk = ItemPointerGetBlockNumber(tid);
+
+						if (blk != last_visited_block)
+						{
+							last_visited_block = blk;
+							if (++n_visited_pages > scan->xs_visited_pages_limit)
+								return false;	/* give up */
+						}
+					}
+					continue;	/* no visible tuple, try next index entry */
+				}
+
+				/* We don't actually need the heap tuple for anything */
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(*heap_continue && IsMVCCSnapshot(scan->xs_snapshot)));
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(scan->heapRelation,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			/*
+			 * Return matching index tuple now set in scan->xs_itup (or return
+			 * matching heap tuple now set in scan->xs_hitup)
+			 */
+			return true;
+		}
+		else
+		{
+			/*
+			 * Fetch the next (or only) visible heap tuple for this index
+			 * entry.  If we don't find anything, loop around and grab the
+			 * next TID from the index.
+			 */
+			if (heapam_index_fetch_heap(scan, hscan, slot, heap_continue))
+				return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * Get the scan's next heap tuple.
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ */
+static pg_attribute_always_inline bool
+heapam_index_fetch_heap(IndexScanDesc scan, IndexFetchHeapData *hscan,
+						TupleTableSlot *slot, bool *heap_continue)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple_impl(scan->heapRelation, hscan,
+										  &scan->xs_heaptid,
+										  scan->xs_snapshot, slot,
+										  heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, tell index
+	 * AM to kill its entry for that TID (this will take effect in the next
+	 * amgettuple call, in index_getnext_tid).  We do not do this when in
+	 * recovery because it may violate MVCC to do so.  See comments in
+	 * RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+		scan->kill_prior_tuple = all_dead;
+
+	return found;
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4fd470702..4ba9f48e9 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
  * since we don't lock the visibility map page either, it's even possible that
  * someone else could have changed the bit just before we look at it, but yet
  * we might see the old value.  It is the caller's responsibility to deal with
- * all concurrency issues!
+ * all concurrency issues!  In practice it can't be stale enough to matter for
+ * the primary use case: index-only scans that check whether a heap fetch can
+ * be skipped.
+ *
+ * The argument for why it can't be stale enough to matter for the primary use
+ * case is as follows:
+ *
+ * Inserts: we need to detect that a VM bit was cleared by an insert right
+ * away, because the new tuple is present in the index but not yet visible.
+ * Reading the TID from the index page (under a shared lock on the index
+ * buffer) is serialized with the insertion of the TID into the index (under
+ * an exclusive lock on the same index buffer).  Because the VM bit is cleared
+ * before the index is updated, and locking/unlocking of the index page acts
+ * as a full memory barrier, we are sure to see the cleared bit whenever we
+ * see a recently-inserted TID.
+ *
+ * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
+ * index page access, because deletes do not update the index page (only
+ * VACUUM removes the index TID).  So we may see a significantly stale value.
+ * However, we don't need to detect the delete right away, because the tuple
+ * remains visible until the deleting transaction commits or the statement
+ * ends (if it's our own transaction).  In either case, the lock on the VM
+ * buffer will have been released (acting as a write barrier) after clearing
+ * the bit.  And for us to have a snapshot that includes the deleting
+ * transaction (making the tuple invisible), we must have acquired
+ * ProcArrayLock after that time, acting as a read barrier.
  */
 uint8
 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 1408989c5..acc9f3e6a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	scan->xs_visited_pages_limit = 0;
+
 	return scan;
 }
 
@@ -454,7 +456,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0,
 										 SO_NONE);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
@@ -518,7 +520,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -716,7 +719,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	if (TransactionIdIsValid(CheckXidAlive))
 		bsysscan = true;
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0,
 									 SO_NONE);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
@@ -736,7 +739,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 44496ae09..5d5e6b6a9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -105,9 +103,16 @@ do { \
 			 CppAsString(pname), RelationGetRelationName(scan->indexRelation)); \
 } while(0)
 
-static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+static pg_attribute_always_inline IndexScanDesc index_beginscan_internal(Relation indexRelation,
+																		 int nkeys,
+																		 int norderbys,
+																		 Snapshot snapshot,
+																		 ParallelIndexScanDesc pscan,
+																		 bool temp_snap,
+																		 Relation heapRelation,
+																		 bool index_only_scan,
+																		 IndexScanInstrumentation *instrument,
+																		 uint32 flags);
 static inline void validate_relation_as_index(Relation r);
 
 
@@ -256,13 +261,12 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool index_only_scan,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys,
 				uint32 flags)
 {
-	IndexScanDesc scan;
-
 	Assert(snapshot != InvalidSnapshot);
 
 	/* Check that a historic snapshot is not used for non-catalog tables */
@@ -275,20 +279,9 @@ index_beginscan(Relation heapRelation,
 						RelationGetRelationName(heapRelation))));
 	}
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
-
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by RelationGetIndexScan.
-	 */
-	scan->heapRelation = heapRelation;
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags);
-
-	return scan;
+	return index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
+									NULL, false, heapRelation,
+									index_only_scan, instrument, flags);
 }
 
 /*
@@ -303,29 +296,26 @@ index_beginscan_bitmap(Relation indexRelation,
 					   IndexScanInstrumentation *instrument,
 					   int nkeys)
 {
-	IndexScanDesc scan;
-
 	Assert(snapshot != InvalidSnapshot);
+	Assert(IsMVCCLikeSnapshot(snapshot));
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
-
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by RelationGetIndexScan.
-	 */
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	return scan;
+	return index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL,
+									false, NULL, false, instrument, SO_NONE);
 }
 
 /*
  * index_beginscan_internal --- common code for index_beginscan variants
+ *
+ * When heapRelation is not NULL, also initializes heap-side scan state:
+ * getnext_slot resolution and table fetch initialization.
  */
-static IndexScanDesc
+static pg_attribute_always_inline IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 Relation heapRelation, bool index_only_scan,
+						 IndexScanInstrumentation *instrument,
+						 uint32 flags)
 {
 	IndexScanDesc scan;
 
@@ -349,6 +339,31 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	scan->xs_snapshot = snapshot;
+	scan->instrument = instrument;
+
+	/*
+	 * Initialize heap-side scan state when a heap relation is provided.
+	 * Bitmap index scans don't provide one.
+	 */
+	if (heapRelation != NULL)
+	{
+		scan->heapRelation = heapRelation;
+		scan->xs_want_itup = index_only_scan;
+		scan->xs_heap_continue = false;
+
+		/* Resolve which getnext_slot implementation to use for this scan */
+		if (index_only_scan)
+			scan->xs_getnext_slot =
+				heapRelation->rd_tableam->index_only_amgettuple_next;
+		else
+			scan->xs_getnext_slot =
+				heapRelation->rd_tableam->index_plain_amgettuple_next;
+
+		/* prepare to fetch index matches from table */
+		scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags);
+	}
+
 	return scan;
 }
 
@@ -377,7 +392,7 @@ index_rescan(IndexScanDesc scan,
 
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+		table_index_fetch_reset(scan);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -399,7 +414,7 @@ index_endscan(IndexScanDesc scan)
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
-		table_index_fetch_end(scan->xs_heapfetch);
+		table_index_fetch_end(scan);
 		scan->xs_heapfetch = NULL;
 	}
 
@@ -454,7 +469,7 @@ index_restrpos(IndexScanDesc scan)
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+		table_index_fetch_reset(scan);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -579,7 +594,7 @@ index_parallelrescan(IndexScanDesc scan)
 	SCAN_CHECKS;
 
 	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+		table_index_fetch_reset(scan);
 
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
@@ -596,41 +611,34 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool index_only_scan,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan,
 						 uint32 flags)
 {
 	Snapshot	snapshot;
-	IndexScanDesc scan;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
-	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
 
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by index_beginscan_internal.
-	 */
-	scan->heapRelation = heaprel;
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags);
-
-	return scan;
+	return index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
+									pscan, true, heaprel, index_only_scan,
+									instrument, flags);
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -661,7 +669,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	{
 		/* release resources (like buffer pins) from table accesses */
 		if (scan->xs_heapfetch)
-			table_index_fetch_reset(scan->xs_heapfetch);
+			table_index_fetch_reset(scan);
 
 		return NULL;
 	}
@@ -673,97 +681,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index c8af97dd2..f1b55fb20 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -560,9 +560,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 * with optimizations like heap's HOT, we have just a single
 				 * index entry for the entire chain.
 				 */
-				else if (table_index_fetch_tuple_check(heapRel, &htid,
-													   &SnapshotDirty,
-													   &all_dead))
+				else if (table_fetch_tid_check(heapRel, &htid,
+											   &SnapshotDirty,
+											   &all_dead))
 				{
 					TransactionId xwait;
 
@@ -618,8 +618,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * entry.
 					 */
 					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
-													  SnapshotSelf, NULL))
+					if (table_fetch_tid_check(heapRel, &htid,
+											  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
 					}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 68ff0966f..acf1b0faa 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -228,32 +228,25 @@ table_beginscan_parallel_tidrange(Relation relation,
  */
 
 /*
- * To perform that check simply start an index scan, create the necessary
- * slot, do the heap lookup, and shut everything down again. This could be
- * optimized, but is unlikely to matter from a performance POV. If there
- * frequently are live index pointers also matching a unique index key, the
- * CPU overhead of this routine is unlikely to matter.
+ * Convenience wrapper around table_fetch_tid() for callers that only need to
+ * know whether a live table tuple exists for a given TID.  This is used to
+ * verify if there's a conflict in a unique index.
  *
  * Note that *tid may be modified when we return true if the AM supports
  * storing multiple row versions reachable via a single index entry (like
  * heap's HOT).
  */
 bool
-table_index_fetch_tuple_check(Relation rel,
-							  ItemPointer tid,
-							  Snapshot snapshot,
-							  bool *all_dead)
+table_fetch_tid_check(Relation rel,
+					  ItemPointer tid,
+					  Snapshot snapshot,
+					  bool *all_dead)
 {
-	IndexFetchTableData *scan;
 	TupleTableSlot *slot;
-	bool		call_again = false;
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel, SO_NONE);
-	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
-									all_dead);
-	table_index_fetch_end(scan);
+	found = table_fetch_tid(rel, tid, snapshot, slot, all_dead);
 	ExecDropSingleTupleTableSlot(slot);
 
 	return found;
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 5450a27fa..97ce81eb5 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -53,7 +53,9 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_fetch_begin != NULL);
 	Assert(routine->index_fetch_reset != NULL);
 	Assert(routine->index_fetch_end != NULL);
-	Assert(routine->index_fetch_tuple != NULL);
+	Assert(routine->index_plain_amgettuple_next != NULL);
+	Assert(routine->index_only_amgettuple_next != NULL);
+	Assert(routine->fetch_tid != NULL);
 
 	Assert(routine->tuple_fetch_row_version != NULL);
 	Assert(routine->tuple_tid_valid != NULL);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 421d8c359..7aff48124 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -105,23 +105,14 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 * removed.
 	 */
 	tmptid = checktid;
+	if (!table_fetch_tid(trigdata->tg_relation, &tmptid, SnapshotSelf,
+						 slot, NULL))
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
-															SO_NONE);
-		bool		call_again = false;
-
-		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
-									 &call_again, NULL))
-		{
-			/*
-			 * All rows referenced by the index entry are dead, so skip the
-			 * check.
-			 */
-			ExecDropSingleTupleTableSlot(slot);
-			table_index_fetch_end(scan);
-			return PointerGetDatum(NULL);
-		}
-		table_index_fetch_end(scan);
+		/*
+		 * All rows referenced by the index entry are dead, so skip the check
+		 */
+		ExecDropSingleTupleTableSlot(slot);
+		return PointerGetDatum(NULL);
 	}
 
 	/*
@@ -168,9 +159,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 		/*
 		 * Note: this is not a real insert; it is a check that the index entry
 		 * that has already been inserted is unique.  Passing the tuple's tid
-		 * (i.e. unmodified by table_index_fetch_tuple()) is correct even if
-		 * the row is now dead, because that is the TID the index will know
-		 * about.
+		 * (i.e. unmodified by table_fetch_tid()) is correct even if the row
+		 * is now dead, because that is the TID the index will know about.
 		 */
 		index_insert(indexRel, values, isnull, &checktid,
 					 trigdata->tg_relation, UNIQUE_CHECK_EXISTING,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b..ca759b0ac 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,7 +136,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1974,7 +1974,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1988,15 +1988,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3860,15 +3857,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				ntablefetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3889,6 +3887,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument->nsearches;
+				ntablefetches = indexstate->ioss_Instrument->ntablefetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3912,9 +3911,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			ntablefetches += winstrument->ntablefetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, ntablefetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 4363e154c..cf792a8dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,12 +815,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index,
-								 &DirtySnapshot, NULL, indnkeyatts, 0,
-								 SO_NONE);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0, SO_NONE);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index b2ca5cbf1..c873bcd82 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,8 +205,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel,
-						   &snap, NULL, skey_attoff, 0, SO_NONE);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0,
+						   SO_NONE);
 
 retry:
 	found = false;
@@ -214,7 +214,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -669,13 +669,13 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel,
-						   SnapshotAny, NULL, skey_attoff, 0, SO_NONE);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL,
+						   skey_attoff, 0, SO_NONE);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d..7045e58ef 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -204,6 +204,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument->nsearches;
+		Assert(node->biss_Instrument->ntablefetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index de6154fd5..729a8a3e5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,7 +34,6 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/instrument.h"
@@ -42,7 +41,6 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -66,7 +64,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -91,7 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -100,11 +97,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   SO_HINT_REL_READ_ONLY : SO_NONE);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -121,78 +114,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction,
+									node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -241,16 +167,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -410,13 +326,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -436,6 +345,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument->nsearches;
+		winstrument->ntablefetches += node->ioss_Instrument->ntablefetches;
 	}
 
 	/*
@@ -792,15 +702,14 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
 								 ScanRelIsReadOnly(&node->ss) ?
 								 SO_HINT_REL_READ_ONLY : SO_NONE);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -860,14 +769,14 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
 								 ScanRelIsReadOnly(&node->ss) ?
 								 SO_HINT_REL_READ_ONLY : SO_NONE);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 1620d1460..49ac9fd2c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -109,7 +109,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -132,7 +132,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -207,7 +207,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -266,7 +266,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -818,6 +818,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument->nsearches;
+		Assert(node->iss_Instrument->ntablefetches == 0);
 	}
 
 	/*
@@ -1730,7 +1731,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1796,7 +1797,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 2de08da65..366a9c1da 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2723,7 +2723,7 @@ ri_FastPathCheck(const RI_ConstraintInfo *riinfo,
 	idx_rel = index_open(riinfo->conindid, AccessShareLock);
 
 	slot = table_slot_create(pk_rel, NULL);
-	scandesc = index_beginscan(pk_rel, idx_rel,
+	scandesc = index_beginscan(pk_rel, idx_rel, false,
 							   snapshot, NULL,
 							   riinfo->nkeys, 0,
 							   SO_NONE);
@@ -2779,7 +2779,7 @@ ri_FastPathProbeOne(Relation pk_rel, Relation idx_rel,
 
 	index_rescan(scandesc, skey, nkeys, NULL, 0);
 
-	if (index_getnext_slot(scandesc, ForwardScanDirection, slot))
+	if (table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 	{
 		bool		concurrently_updated;
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 4160d2d6e..6a1dfab51 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7121,10 +7120,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7172,62 +7167,26 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * We set xs_visited_pages_limit to tell the table AM to count distinct
+	 * heap pages visited for non-visible tuples and give up after the limit
+	 * is exceeded.
 	 */
+#define VISITED_PAGES_LIMIT 100
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, true,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0,
 								 SO_NONE);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
+	index_scan->xs_visited_pages_limit = VISITED_PAGES_LIMIT;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
-
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
-#define VISITED_PAGES_LIMIT 100
-
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
-
 		/*
 		 * We expect that the index will return data in IndexTuple not
 		 * HeapTuple format.
@@ -7259,8 +7218,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
-- 
2.53.0



  [application/octet-stream] v20-0002-Move-heapam_handler.c-index-scan-code-to-new-fil.patch (21.3K, 18-v20-0002-Move-heapam_handler.c-index-scan-code-to-new-fil.patch)
  download | inline diff:
From e82059b0944c97e22eb6c5d884f224b2a807bed1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 25 Mar 2026 00:22:17 -0400
Subject: [PATCH v20 02/17] Move heapam_handler.c index scan code to new file.

Move the heapam index fetch callbacks (index_fetch_begin,
index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
dedicated file.  Also move heap_hot_search_buffer over.  This is a
purely mechanical move with no functional impact.

Upcoming work to add a slot-based table AM interface for index scans
will substantially expand this code.  Keeping it in heapam_handler.c
would clutter a file whose primary role is to wire up the TableAmRoutine
callbacks.  Bitmap heap scans and sequential scans would benefit from
similar separation in the future.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
---
 src/include/access/heapam.h                |  15 +-
 src/backend/access/heap/Makefile           |   1 +
 src/backend/access/heap/heapam.c           | 161 ------------
 src/backend/access/heap/heapam_handler.c   | 111 --------
 src/backend/access/heap/heapam_indexscan.c | 292 +++++++++++++++++++++
 src/backend/access/heap/meson.build        |   1 +
 6 files changed, 306 insertions(+), 275 deletions(-)
 create mode 100644 src/backend/access/heap/heapam_indexscan.c

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 54067b828..cc90c821b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -366,9 +366,6 @@ extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
 									  TupleTableSlot *slot);
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 					   HeapTuple tuple, Buffer *userbuf, bool keep_buf);
-extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
-								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
-								   bool *all_dead, bool first_call);
 
 extern void heap_get_latest_tid(TableScanDesc sscan, ItemPointer tid);
 
@@ -431,6 +428,18 @@ extern void simple_heap_update(Relation relation, const ItemPointerData *otid,
 extern TransactionId heap_index_delete_tuples(Relation rel,
 											  TM_IndexDeleteOp *delstate);
 
+/* in heap/heapam_indexscan.c */
+extern IndexFetchTableData *heapam_index_fetch_begin(Relation rel, uint32 flags);
+extern void heapam_index_fetch_reset(IndexFetchTableData *scan);
+extern void heapam_index_fetch_end(IndexFetchTableData *scan);
+extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
+								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
+								   bool *all_dead, bool first_call);
+extern bool heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
+									 ItemPointer tid, Snapshot snapshot,
+									 TupleTableSlot *slot, bool *heap_continue,
+									 bool *all_dead);
+
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer,
 								Buffer *vmbuffer, bool rel_read_only);
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index 394534172..1d27ccb91 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	heapam.o \
 	heapam_handler.o \
+	heapam_indexscan.o \
 	heapam_visibility.o \
 	heapam_xlog.o \
 	heaptoast.o \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6bff0032d..e06ce2db2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1764,167 +1764,6 @@ heap_fetch(Relation relation,
 	return false;
 }
 
-/*
- *	heap_hot_search_buffer	- search HOT chain for tuple satisfying snapshot
- *
- * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
- * of a HOT chain), and buffer is the buffer holding this tuple.  We search
- * for the first chain member satisfying the given snapshot.  If one is
- * found, we update *tid to reference that tuple's offset number, and
- * return true.  If no match, return false without modifying *tid.
- *
- * heapTuple is a caller-supplied buffer.  When a match is found, we return
- * the tuple here, in addition to updating *tid.  If no match is found, the
- * contents of this buffer on return are undefined.
- *
- * If all_dead is not NULL, we check non-visible tuples to see if they are
- * globally dead; *all_dead is set true if all members of the HOT chain
- * are vacuumable, false if not.
- *
- * Unlike heap_fetch, the caller must already have pin and (at least) share
- * lock on the buffer; it is still pinned/locked at exit.
- */
-bool
-heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
-					   Snapshot snapshot, HeapTuple heapTuple,
-					   bool *all_dead, bool first_call)
-{
-	Page		page = BufferGetPage(buffer);
-	TransactionId prev_xmax = InvalidTransactionId;
-	BlockNumber blkno;
-	OffsetNumber offnum;
-	bool		at_chain_start;
-	bool		valid;
-	bool		skip;
-	GlobalVisState *vistest = NULL;
-
-	/* If this is not the first call, previous call returned a (live!) tuple */
-	if (all_dead)
-		*all_dead = first_call;
-
-	blkno = ItemPointerGetBlockNumber(tid);
-	offnum = ItemPointerGetOffsetNumber(tid);
-	at_chain_start = first_call;
-	skip = !first_call;
-
-	/* XXX: we should assert that a snapshot is pushed or registered */
-	Assert(TransactionIdIsValid(RecentXmin));
-	Assert(BufferGetBlockNumber(buffer) == blkno);
-
-	/* Scan through possible multiple members of HOT-chain */
-	for (;;)
-	{
-		ItemId		lp;
-
-		/* check for bogus TID */
-		if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page))
-			break;
-
-		lp = PageGetItemId(page, offnum);
-
-		/* check for unused, dead, or redirected items */
-		if (!ItemIdIsNormal(lp))
-		{
-			/* We should only see a redirect at start of chain */
-			if (ItemIdIsRedirected(lp) && at_chain_start)
-			{
-				/* Follow the redirect */
-				offnum = ItemIdGetRedirect(lp);
-				at_chain_start = false;
-				continue;
-			}
-			/* else must be end of chain */
-			break;
-		}
-
-		/*
-		 * Update heapTuple to point to the element of the HOT chain we're
-		 * currently investigating. Having t_self set correctly is important
-		 * because the SSI checks and the *Satisfies routine for historical
-		 * MVCC snapshots need the correct tid to decide about the visibility.
-		 */
-		heapTuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
-		heapTuple->t_len = ItemIdGetLength(lp);
-		heapTuple->t_tableOid = RelationGetRelid(relation);
-		ItemPointerSet(&heapTuple->t_self, blkno, offnum);
-
-		/*
-		 * Shouldn't see a HEAP_ONLY tuple at chain start.
-		 */
-		if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
-			break;
-
-		/*
-		 * The xmin should match the previous xmax value, else chain is
-		 * broken.
-		 */
-		if (TransactionIdIsValid(prev_xmax) &&
-			!TransactionIdEquals(prev_xmax,
-								 HeapTupleHeaderGetXmin(heapTuple->t_data)))
-			break;
-
-		/*
-		 * When first_call is true (and thus, skip is initially false) we'll
-		 * return the first tuple we find.  But on later passes, heapTuple
-		 * will initially be pointing to the tuple we returned last time.
-		 * Returning it again would be incorrect (and would loop forever), so
-		 * we skip it and return the next match we find.
-		 */
-		if (!skip)
-		{
-			/* If it's visible per the snapshot, we must return it */
-			valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
-			HeapCheckForSerializableConflictOut(valid, relation, heapTuple,
-												buffer, snapshot);
-
-			if (valid)
-			{
-				ItemPointerSetOffsetNumber(tid, offnum);
-				PredicateLockTID(relation, &heapTuple->t_self, snapshot,
-								 HeapTupleHeaderGetXmin(heapTuple->t_data));
-				if (all_dead)
-					*all_dead = false;
-				return true;
-			}
-		}
-		skip = false;
-
-		/*
-		 * If we can't see it, maybe no one else can either.  At caller
-		 * request, check whether all chain members are dead to all
-		 * transactions.
-		 *
-		 * Note: if you change the criterion here for what is "dead", fix the
-		 * planner's get_actual_variable_range() function to match.
-		 */
-		if (all_dead && *all_dead)
-		{
-			if (!vistest)
-				vistest = GlobalVisTestFor(relation);
-
-			if (!HeapTupleIsSurelyDead(heapTuple, vistest))
-				*all_dead = false;
-		}
-
-		/*
-		 * Check to see if HOT chain continues past this tuple; if so fetch
-		 * the next offnum and loop around.
-		 */
-		if (HeapTupleIsHotUpdated(heapTuple))
-		{
-			Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
-				   blkno);
-			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
-			at_chain_start = false;
-			prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
-		}
-		else
-			break;				/* end of chain */
-	}
-
-	return false;
-}
-
 /*
  *	heap_get_latest_tid -  get the latest tid of a specified tuple
  *
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dc7db5885..07f07188d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -75,117 +75,6 @@ heapam_slot_callbacks(Relation relation)
 }
 
 
-/* ------------------------------------------------------------------------
- * Index Scan Callbacks for heap AM
- * ------------------------------------------------------------------------
- */
-
-static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel, uint32 flags)
-{
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
-
-	hscan->xs_base.rel = rel;
-	hscan->xs_base.flags = flags;
-	hscan->xs_cbuf = InvalidBuffer;
-	hscan->xs_vmbuffer = InvalidBuffer;
-
-	return &hscan->xs_base;
-}
-
-static void
-heapam_index_fetch_reset(IndexFetchTableData *scan)
-{
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
-
-	if (BufferIsValid(hscan->xs_cbuf))
-	{
-		ReleaseBuffer(hscan->xs_cbuf);
-		hscan->xs_cbuf = InvalidBuffer;
-	}
-
-	if (BufferIsValid(hscan->xs_vmbuffer))
-	{
-		ReleaseBuffer(hscan->xs_vmbuffer);
-		hscan->xs_vmbuffer = InvalidBuffer;
-	}
-}
-
-static void
-heapam_index_fetch_end(IndexFetchTableData *scan)
-{
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
-
-	heapam_index_fetch_reset(scan);
-
-	pfree(hscan);
-}
-
-static bool
-heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
-						 ItemPointer tid,
-						 Snapshot snapshot,
-						 TupleTableSlot *slot,
-						 bool *heap_continue, bool *all_dead)
-{
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
-	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
-	bool		got_heap_tuple;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*heap_continue)
-	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
-
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
-
-		/*
-		 * Prune page, but only if we weren't already on this page
-		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
-								&hscan->xs_vmbuffer,
-								hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
-	}
-
-	/* Obtain share-lock on the buffer so we can examine visibility */
-	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
-	got_heap_tuple = heap_hot_search_buffer(tid,
-											hscan->xs_base.rel,
-											hscan->xs_cbuf,
-											snapshot,
-											&bslot->base.tupdata,
-											all_dead,
-											!*heap_continue);
-	bslot->base.tupdata.t_self = *tid;
-	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
-
-	if (got_heap_tuple)
-	{
-		/*
-		 * Only in a non-MVCC snapshot can more than one member of the HOT
-		 * chain be visible.
-		 */
-		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
-
-		slot->tts_tableOid = RelationGetRelid(scan->rel);
-		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
-	}
-	else
-	{
-		/* We've reached the end of the HOT chain. */
-		*heap_continue = false;
-	}
-
-	return got_heap_tuple;
-}
-
-
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
  * ------------------------------------------------------------------------
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
new file mode 100644
index 000000000..c36b804d1
--- /dev/null
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -0,0 +1,292 @@
+/*-------------------------------------------------------------------------
+ *
+ * heapam_indexscan.c
+ *	  heap table plain index scan and index-only scan code
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/heapam_indexscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/relscan.h"
+#include "storage/predicate.h"
+
+
+/* ------------------------------------------------------------------------
+ * Index Scan Callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+IndexFetchTableData *
+heapam_index_fetch_begin(Relation rel, uint32 flags)
+{
+	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+
+	hscan->xs_base.rel = rel;
+	hscan->xs_base.flags = flags;
+	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_vmbuffer = InvalidBuffer;
+
+	return &hscan->xs_base;
+}
+
+void
+heapam_index_fetch_reset(IndexFetchTableData *scan)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+
+	if (BufferIsValid(hscan->xs_cbuf))
+	{
+		ReleaseBuffer(hscan->xs_cbuf);
+		hscan->xs_cbuf = InvalidBuffer;
+	}
+
+	if (BufferIsValid(hscan->xs_vmbuffer))
+	{
+		ReleaseBuffer(hscan->xs_vmbuffer);
+		hscan->xs_vmbuffer = InvalidBuffer;
+	}
+}
+
+void
+heapam_index_fetch_end(IndexFetchTableData *scan)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+
+	heapam_index_fetch_reset(scan);
+
+	pfree(hscan);
+}
+
+/*
+ *	heap_hot_search_buffer	- search HOT chain for tuple satisfying snapshot
+ *
+ * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
+ * of a HOT chain), and buffer is the buffer holding this tuple.  We search
+ * for the first chain member satisfying the given snapshot.  If one is
+ * found, we update *tid to reference that tuple's offset number, and
+ * return true.  If no match, return false without modifying *tid.
+ *
+ * heapTuple is a caller-supplied buffer.  When a match is found, we return
+ * the tuple here, in addition to updating *tid.  If no match is found, the
+ * contents of this buffer on return are undefined.
+ *
+ * If all_dead is not NULL, we check non-visible tuples to see if they are
+ * globally dead; *all_dead is set true if all members of the HOT chain
+ * are vacuumable, false if not.
+ *
+ * Unlike heap_fetch, the caller must already have pin and (at least) share
+ * lock on the buffer; it is still pinned/locked at exit.
+ */
+bool
+heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
+					   Snapshot snapshot, HeapTuple heapTuple,
+					   bool *all_dead, bool first_call)
+{
+	Page		page = BufferGetPage(buffer);
+	TransactionId prev_xmax = InvalidTransactionId;
+	BlockNumber blkno;
+	OffsetNumber offnum;
+	bool		at_chain_start;
+	bool		valid;
+	bool		skip;
+	GlobalVisState *vistest = NULL;
+
+	/* If this is not the first call, previous call returned a (live!) tuple */
+	if (all_dead)
+		*all_dead = first_call;
+
+	blkno = ItemPointerGetBlockNumber(tid);
+	offnum = ItemPointerGetOffsetNumber(tid);
+	at_chain_start = first_call;
+	skip = !first_call;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+	Assert(BufferGetBlockNumber(buffer) == blkno);
+
+	/* Scan through possible multiple members of HOT-chain */
+	for (;;)
+	{
+		ItemId		lp;
+
+		/* check for bogus TID */
+		if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(page))
+			break;
+
+		lp = PageGetItemId(page, offnum);
+
+		/* check for unused, dead, or redirected items */
+		if (!ItemIdIsNormal(lp))
+		{
+			/* We should only see a redirect at start of chain */
+			if (ItemIdIsRedirected(lp) && at_chain_start)
+			{
+				/* Follow the redirect */
+				offnum = ItemIdGetRedirect(lp);
+				at_chain_start = false;
+				continue;
+			}
+			/* else must be end of chain */
+			break;
+		}
+
+		/*
+		 * Update heapTuple to point to the element of the HOT chain we're
+		 * currently investigating. Having t_self set correctly is important
+		 * because the SSI checks and the *Satisfies routine for historical
+		 * MVCC snapshots need the correct tid to decide about the visibility.
+		 */
+		heapTuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
+		heapTuple->t_len = ItemIdGetLength(lp);
+		heapTuple->t_tableOid = RelationGetRelid(relation);
+		ItemPointerSet(&heapTuple->t_self, blkno, offnum);
+
+		/*
+		 * Shouldn't see a HEAP_ONLY tuple at chain start.
+		 */
+		if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
+			break;
+
+		/*
+		 * The xmin should match the previous xmax value, else chain is
+		 * broken.
+		 */
+		if (TransactionIdIsValid(prev_xmax) &&
+			!TransactionIdEquals(prev_xmax,
+								 HeapTupleHeaderGetXmin(heapTuple->t_data)))
+			break;
+
+		/*
+		 * When first_call is true (and thus, skip is initially false) we'll
+		 * return the first tuple we find.  But on later passes, heapTuple
+		 * will initially be pointing to the tuple we returned last time.
+		 * Returning it again would be incorrect (and would loop forever), so
+		 * we skip it and return the next match we find.
+		 */
+		if (!skip)
+		{
+			/* If it's visible per the snapshot, we must return it */
+			valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
+			HeapCheckForSerializableConflictOut(valid, relation, heapTuple,
+												buffer, snapshot);
+
+			if (valid)
+			{
+				ItemPointerSetOffsetNumber(tid, offnum);
+				PredicateLockTID(relation, &heapTuple->t_self, snapshot,
+								 HeapTupleHeaderGetXmin(heapTuple->t_data));
+				if (all_dead)
+					*all_dead = false;
+				return true;
+			}
+		}
+		skip = false;
+
+		/*
+		 * If we can't see it, maybe no one else can either.  At caller
+		 * request, check whether all chain members are dead to all
+		 * transactions.
+		 *
+		 * Note: if you change the criterion here for what is "dead", fix the
+		 * planner's get_actual_variable_range() function to match.
+		 */
+		if (all_dead && *all_dead)
+		{
+			if (!vistest)
+				vistest = GlobalVisTestFor(relation);
+
+			if (!HeapTupleIsSurelyDead(heapTuple, vistest))
+				*all_dead = false;
+		}
+
+		/*
+		 * Check to see if HOT chain continues past this tuple; if so fetch
+		 * the next offnum and loop around.
+		 */
+		if (HeapTupleIsHotUpdated(heapTuple))
+		{
+			Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
+				   blkno);
+			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
+			at_chain_start = false;
+			prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
+		}
+		else
+			break;				/* end of chain */
+
+	}
+
+	return false;
+}
+
+bool
+heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
+						 ItemPointer tid,
+						 Snapshot snapshot,
+						 TupleTableSlot *slot,
+						 bool *heap_continue, bool *all_dead)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	bool		got_heap_tuple;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
+
+	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
+	if (!*heap_continue)
+	{
+		/* Switch to correct buffer if we don't have it already */
+		Buffer		prev_buf = hscan->xs_cbuf;
+
+		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+											  hscan->xs_base.rel,
+											  ItemPointerGetBlockNumber(tid));
+
+		/*
+		 * Prune page, but only if we weren't already on this page
+		 */
+		if (prev_buf != hscan->xs_cbuf)
+			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
+								&hscan->xs_vmbuffer,
+								hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
+	}
+
+	/* Obtain share-lock on the buffer so we can examine visibility */
+	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
+	got_heap_tuple = heap_hot_search_buffer(tid,
+											hscan->xs_base.rel,
+											hscan->xs_cbuf,
+											snapshot,
+											&bslot->base.tupdata,
+											all_dead,
+											!*heap_continue);
+	bslot->base.tupdata.t_self = *tid;
+	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
+
+	if (got_heap_tuple)
+	{
+		/*
+		 * Only in a non-MVCC snapshot can more than one member of the HOT
+		 * chain be visible.
+		 */
+		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
+
+		slot->tts_tableOid = RelationGetRelid(scan->rel);
+		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
+	}
+	else
+	{
+		/* We've reached the end of the HOT chain. */
+		*heap_continue = false;
+	}
+
+	return got_heap_tuple;
+}
diff --git a/src/backend/access/heap/meson.build b/src/backend/access/heap/meson.build
index 92ab8be3d..00ec07d7f 100644
--- a/src/backend/access/heap/meson.build
+++ b/src/backend/access/heap/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'heapam.c',
   'heapam_handler.c',
+  'heapam_indexscan.c',
   'heapam_visibility.c',
   'heapam_xlog.c',
   'heaptoast.c',
-- 
2.53.0



view thread (367+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <CAH2-Wz=t3G53xKGYEWqm_QV35ExRgT2k=qhw_VHe5oGjdFRwtA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox