Re: index prefetching - Peter Geoghegan

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Peter Geoghegan <[email protected]>
To: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Konstantin Knizhnik <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Mon, 22 Jun 2026 16:48:10 -0400
Message-ID: <CAH2-WzkZTkDuyVFszLwPJesF9pS5E8m0UA+344bx-B-zfA5kaw@mail.gmail.com> (raw)
In-Reply-To: <CAH2-Wzn+C=mAMv9aW3Skfh80JPpHKT3yM=DYwkrrhYyG2f+_vg@mail.gmail.com>
References: <CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com>
	<CAH2-Wz=Vxsgas35ZzOJJW1ceqp9TJ2DFhKmXULwUAcVpfD73xA@mail.gmail.com>
	<CAH2-Wz=kMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G=zrw@mail.gmail.com>
	<CAH2-WzkFRoTjD9T8ykYDzOMxzGiWFqcAkbK8B=HjfpoMdM4E8A@mail.gmail.com>
	<chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj>
	<CAH2-Wz=nz-zr=gaXL1od_F7dcr=7d+3jEEveqY-bgcAKF6wZJQ@mail.gmail.com>
	<CAH2-Wz=t3G53xKGYEWqm_QV35ExRgT2k=qhw_VHe5oGjdFRwtA@mail.gmail.com>
	<CAH2-WzkiCK=wELiXPgriN4r7cJzGb3Xg48E9YHrFEyEPTkynOw@mail.gmail.com>
	<s25f6opuqwmz7oad467twvgxc36zmgzguhph43z4sbfkflsjnq@r6cuj7cl36lg>
	<CAH2-Wzm9r6sCDVkXHXLobQFE3WcEPf-pm=g6S93u4W9JLK2VDA@mail.gmail.com>
	<6sphk3ycctmbihlrykts7uj6mjakop6wrq2dhe3vnlmrnldz2f@uuwmkd6jjrxa>
	<CAH2-WzmAKApAYttMQfhh-usweXvRYpHcku-OKHYgDvb=wLSWHg@mail.gmail.com>
	<CAH2-Wzn+C=mAMv9aW3Skfh80JPpHKT3yM=DYwkrrhYyG2f+_vg@mail.gmail.com>

On Sun, Jun 14, 2026 at 2:54 PM Peter Geoghegan <[email protected]> wrote:
> Attached is v27, which cleans up some remaining loose ends.

V28 is attached.

Executive summary:

v28 includes better "start read stream" heuristics, adds SP-GiST
support for amgetbatch, improves the heuristics that bound the work
that get_actual_variable_range performs with a severely bloated index
during planning, and adds more extensive testing of tricky nbtree
edge-cases.

Full details:

* v28 replaces the "only start prefetching on second batch" heuristic
with a more principled approach based on the number of heap blocks
read so far. We now create a read stream the fourth time the scan
switches to another heap block. This cutoff was derived empirically,
by following a process driven by my microbenchmark suite.

This is the standout item in v28. Obviously, the old heuristic left
money on the table regarding prefetching benefits [1] -- that is the
main reason to replace it with something better. Certain
somewhat-selective queries are now benefit much more from prefetching,
without it regressing anything else. But the new heuristic has 2
additional nonobvious advantages:

1. Typically, the new heuristic will start prefetching much earlier
than the old one. But sometimes the opposite occurs: prefetching takes
*longer* to begin. That also helps performance (because we weigh the
actual cost we care about, not a noisy proxy for it).

This helps with some of my adversarial test queries, which still had
regressions before v28. I'm referring to adversarial queries that
perform lateral joins with a LIMIT on the inner side of a nestloop
join (e.g., the A8 query that some of us discussed over IM). Now there
are no regressions at all in my main microbenchmark test suite, which
was an unexpected bonus.

We now seem to avoid all regressions related to LIMITs, lateral joins,
nestloop antijoins, and nestloop semijoins -- without needing to pass
information from the planner down to the scan to do so.

2. The rules determining when to create a read stream are now exactly
the same for index-only scans -- no special case logic is required
anymore. This is simpler and more elegant.

The old heuristic was actually: "Start prefetching on the second
batch, except when there have been exactly zero heap fetches so far
(possible only during an inde-only scan)". The new heuristic only
considers heap fetches.

* A new v28 patch adds amgetbatch support to SP-GiST, the last in-core
index AM that still used the amgettuple interface. Performance
improves to a degree very similar to other index AMs: range scan
queries can be 20x to 25x faster.

An additional restriction applies to index-only scans, which is a
separate issue peculiar to SP-GiST: index-only scans are now disabled
for "long values" opclasses such as the text radix opclass. Unlike the
similar IoS issue in GiST + SP-GiST (the ordered scans issue), this is
a legitimate limitation in the amgetbatch design itself (there is no
existing bug involved here). Don't confuse these two separate
IoS-related issues.

Here's why I just desupported index-only scans for this particular
subset of SP-GiST opclasses: using "long values" doesn't fit well with
our resource management strategy during amgetbatch scans. Our usual
approach of reconstructing a value runs into the problem that the
required metadata size is essentially unbounded. The prefix cannot
reliably fit into a fixed per-batch reconstruction workspace; we can't
use a generic conservative estimate known at the start of the scan.
Note that some prefix style opclasses are not affected at all -- those
that already promise to keep their prefix size under BLCKSZ (our usual
amgettransform approach can work there).

I think we should just accept this limitation to gain the benefit of
prefetching during SP-GiST scans. I'm loathe to invent complicated new
heapam infrastructure just to deal with this problem. I'd rather just
not support SP-GiST at all.

The SP-GiST patch is still very much WIP. I changed how SP-GiST VACUUM
uses its own read stream (for prefetching on index pages) to avoid
obtaining self-conflicting cleanup locks, which is another aspect of
SP-GiST that presented a unique challenge. See the
SPGIST_VACUUM_DRAIN_INTERVAL mechanism.

* v28 includes a new patch that augments the existing
VISITED_PAGES_LIMIT mechanism (which get_actual_variable_range uses to
limit the amount of work its scan performs in pathological cases) with
a new INDEX_PAGES_LIMIT limit. This fixes a complaint from Mark
Callaghan [2] about planning time for range queries becoming excessive
with queue-like tables (because VISITED_PAGES_LIMIT alone was
ineffective).

Note that this is now independent work; it really has nothing to do
with amgetbatch support per se. I'm including it now, because the work
in this area is an offshoot of the new slot-based index scan design,
and was previously discussed on this thread.

Background: Earlier versions of this patch set included a roughly
comparable patch, which I abandoned towards the end of the PG 19
cycle. My earlier proposal wholly replaced VISITED_PAGES_LIMIT; this
new patch complements VISITED_PAGES_LIMIT, by specifically targeting
its one major weakness. That is, this new INDEX_PAGES_LIMIT mechanism
will only tally index leaf page reads that return zero matching items
to the table AM (i.e., those that don't return any batch). And so
INDEX_PAGES_LIMIT is complementary to VISITED_PAGES_LIMIT; it doesn't
replace it.

The results from one microbenchmark indicate that the pathological
case is fully fixed (times given are for planning time, in
milliseconds):

1. Baseline (no dead tuples):
  Metric           Master        Patch      Ratio
  ---------- ------------ ------------ ----------
  Avg               0.140        0.129      0.916x
  p95               0.213        0.188      0.882x
  p99               0.259        0.224      0.867x

3. Main (with bulk-deleted tuples, concurrent inserts/deletes):
  Metric           Master        Patch      Ratio
  ---------- ------------ ------------ ----------
  Avg               3.017        0.259      0.086x
  p95              10.055        0.230      0.023x
  p99              28.034        2.121      0.076x

It makes sense that planning time can double in extreme cases due to
concentrated index bloat (relative to the case with no bloat). It does
not make sense for it to increase by 20x or more. The important thing
is that the worst case is some fixed small-ish multiple of the
best/average case -- there's absolutely no sensible reason to expect
get_actual_variable_range to not at least achieve that.

* v28 also adds a new "read stream throttling" mechanism to prevent
calls to the read stream callback during index-only scans that return
many batches with no prefetchable heap blocks (because every batch
consists of matching items that are all-visible). This is also used
during plain index scans, though that is much less critical.

Throttling works by pausing the read stream when we detect that
amgetbatch calls aren't producing any useful heap block numbers for
the read stream to consume. It makes no sense for a single call to the
read stream callback to invoke amgetbatch *several* times without
returning even one usable heap block number to the read stream. As a
bonus, this adds test coverage for the pausing mechanism.

I developed an adversarial microbenchmark for this. v27 of the patch
performed pretty poorly here:

  === ios / cached ===
  build / setting          buffers (h+r)  bufs vs master heap_fetch
io_read(ms)  exec(ms)  exec x
  ----------------------------------------------------------------------------------------------
  master (no prefetch)                 9       identical          5
       -     0.017   1.00x
  rc prefetch=off                      9       identical          5
       -     0.013   0.76x
  rc prefetch=on                      72             +63          5
       -     0.276  16.24x

v28 leaves the rc prefetch=on case ~1.5x slower than
master/prefetch=off (not shown in the table). This seems quite
acceptable for an unrealistic LIMIT N microbenchmark such as this
(prefetching begins exactly where there'll be no more heap fetches,
which is inherently impossible to handle without any new overhead).

* v28 adds tests that cover nbtree edge cases involving array keys:
one test relies critically on a call to nbtree's btposreset for
mark/restore, another covers the case where btposreset is critical to
correctly backing up a cursor across a batch boundary for a scan with
array keys.

At one point Andres asked about adding such test coverage. It was
tricky to do it in a way that didn't require lots of data, but I found
a way to do it that keeps the added test cycles at an acceptable
level.

> There's also a new isolation test (dirty_snapshot.spec) which
> illustrates the role of dirty snapshots in constraint enforcement.

* I removed this in v28 because it didn't seem to be pulling its
weight: too many added test cycles for no actual new test coverage.

> > * There have been recent CI failures on "linux-autoconf", that I
> > haven't debugged.
> >
> > 001_aio.pl seems to reliably report "Dubious, test returned 4 (wstat
> > 1024, 0x400)" on this CI target. I fully expect this v26 to fail when
> > CFTester runs it through CI.

I suspect that this is just a bug in one of the tests added to
001_aio.pl. I can recreate the failure locally with a
-DRELCACHE_FORCE_RELEASE build.

For now I have disabled that one test case, without removing it, just
so CI passes.

[1] https://postgr.es/m/[email protected]
[2] https://postgr.es/m/CAH2-Wzkt1WkKp4VRJu3qHfmKXc8W+XYv1RXg5d2d3fSvAeO=rg@mail.gmail.com
--
Peter Geoghegan


Attachments:

  [application/octet-stream] v28-0011-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch (3.1K, 2-v28-0011-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch)
  download | inline diff:
From 1e1f37a73c82893da73574e14ec46d016414a402 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sun, 22 Mar 2026 15:19:08 -0400
Subject: [PATCH v28 11/11] WIP: aio: bufmgr: Fix race condition leading to
 deadlocks with io_uring

If backend A is in the process of starting IO for a buffer, there is a short
period in which the buffer is marked as IO_IN_PROGRESS without having an
associated AIO wait reference. If a backend B does WaitIO() on that buffer,
it'll wait for the buffer's IO condition variable to be set. Most of the time
that is OK, when the IO on the buffer finishes, the CV will be signalled.
However, with io_uring, it is possible that the issuer (A) of the IO never
gets around to doing so, e.g. because it is waiting for something done by B.

To fix that, we need to signal the CV when staging IO. That's annoying as CV
broadcasts are not cheap. So we at least avoid it for the common case of IO
being executed synchronously.

I hope that eventually we can get away from needing multiple systems for
signalling IO completion, but we are clearly not there yet.
---
 src/backend/storage/buffer/bufmgr.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b9615313..e8e8a2122 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -8321,7 +8321,7 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
  * replaced while IO is ongoing.
  */
 static pg_attribute_always_inline void
-buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+buffer_stage_common(PgAioHandle *ioh, uint8 cb_data, bool is_write, bool is_temp)
 {
 	uint64	   *io_data;
 	uint8		handle_data_len;
@@ -8420,7 +8420,23 @@ buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
 		 * keeps track.
 		 */
 		if (!is_temp)
+		{
 			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+
+			/*
+			 * A backend might have started waiting for the IO using the
+			 * buffer's condition variable, but once the IO is submitted, it
+			 * should wait via the AIO subsystem, as a waiter might need to
+			 * complete the IO.
+			 *
+			 * However, doing broadcasts is not free, so we like to avoid it
+			 * when not necessary. If the IO is being executed synchronously,
+			 * this backend will always end up signalling the IOCV without
+			 * further waiting, therefore avoid doing so here.
+			 */
+			if (!(cb_data & READ_BUFFERS_SYNCHRONOUSLY))
+				ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf_hdr));
+		}
 	}
 }
 
@@ -8916,7 +8932,7 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *td,
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
-	buffer_stage_common(ioh, false, false);
+	buffer_stage_common(ioh, cb_data, false, false);
 }
 
 static PgAioResult
@@ -8967,7 +8983,7 @@ shared_buffer_readv_complete_local(PgAioHandle *ioh, PgAioResult prior_result,
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
-	buffer_stage_common(ioh, false, true);
+	buffer_stage_common(ioh, cb_data, false, true);
 }
 
 static PgAioResult
-- 
2.53.0



  [application/octet-stream] v28-0009-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch (20.7K, 3-v28-0009-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch)
  download | inline diff:
From c45e3a6bd89f42074b540c4d15d7c4c306fdb973 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sun, 5 Apr 2026 00:43:54 -0400
Subject: [PATCH v28 09/11] Allow read_stream_reset() to not wait for IO
 completion

Not waiting for IO during read_stream_reset() can be important for performance
in cases where read streams are frequently reset before the end is
reached. Current users do not commonly do that, but the upcoming work to use a
read stream to prefetch table blocks as part of index scans can do so
frequently in some query patterns. E.g. if there is an index scan on the inner
side of a nested loop antijoin.

This takes a bit of care to do right. Just introducing support for abandoning
a AIO handle could lead to the IO's completion not being processed until the
backend exits. That's bad because it would lead to resources held onto for the
IO (e.g. buffer pins) not being released and the handle showing up in pg_aios.

To avoid that, the existing resowner cleanup is changed to wait for the IO's
completion, which guarantees that by the end of the statement the IO has
completed. We might eventually want to relax that for some operations (e.g.,
for background WAL writes or opportunistic prefetching).

Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
---
 src/include/storage/aio.h                   |   2 +
 src/include/storage/bufmgr.h                |   1 +
 src/backend/storage/aio/aio.c               | 102 +++++++++++++++---
 src/backend/storage/aio/read_stream.c       |  31 ++++--
 src/backend/storage/buffer/bufmgr.c         |  34 ++++++
 src/test/modules/test_aio/t/001_aio.pl      | 113 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |   2 +-
 src/test/modules/test_aio/test_aio.c        |  38 +++++--
 8 files changed, 291 insertions(+), 32 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ec543b784..ab7fad130 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -328,6 +328,8 @@ extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
 extern void pgaio_wref_wait(PgAioWaitRef *iow);
 extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
 
+extern void pgaio_wref_abandon(PgAioWaitRef *iow);
+
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6837b35fc..c02bfd685 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -252,6 +252,7 @@ extern bool StartReadBuffers(ReadBuffersOperation *operation,
 							 int *nblocks,
 							 int flags);
 extern bool WaitReadBuffers(ReadBuffersOperation *operation);
+extern void AbandonReadBuffers(ReadBuffersOperation *operation);
 
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8f7e26607..7009fb7f6 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -152,11 +152,15 @@ const IoMethodOps *pgaio_method_ops;
  * operation succeeded and details about the first failure, if any. The error
  * can be raised / logged with pgaio_result_report().
  *
- * The lifetime of the memory pointed to be *ret needs to be at least as long
- * as the passed in resowner. If the resowner releases resources before the IO
- * completes (typically due to an error), the reference to *ret will be
- * cleared. In case of resowner cleanup *ret will not be updated with the
- * results of the IO operation.
+ * The lifetime of the memory pointed to by *ret needs to be at least as long
+ * as the passed in resowner.
+ *
+ * If the resowner releases resources before the IO completes (typically due
+ * to an error), the reference to *ret will be cleared. In case of resowner
+ * cleanup *ret will not be updated with the results of the IO operation.
+ *
+ * If the caller loses interest in the IO before completion
+ * pgaio_wref_abandon() can be used.
  */
 PgAioHandle *
 pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
@@ -278,6 +282,14 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
 	ioh->resowner = NULL;
 
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away. Do so before potentially waiting
+	 * below, as that could cause the undesired writes.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+
 	switch ((PgAioHandleState) ioh->state)
 	{
 		case PGAIO_HS_IDLE:
@@ -300,22 +312,32 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 			if (!on_error)
 				elog(WARNING, "AIO handle was not submitted");
 			pgaio_submit_staged();
-			break;
+
+			/* now that the IO is submitted, need to wait */
+			pg_fallthrough;
 		case PGAIO_HS_SUBMITTED:
 		case PGAIO_HS_COMPLETED_IO:
 		case PGAIO_HS_COMPLETED_SHARED:
 		case PGAIO_HS_COMPLETED_LOCAL:
-			/* this is expected to happen */
+
+			/*
+			 * This is expected to happen, e.g. after an error or after
+			 * pgaio_wref_abandon() was called.
+			 *
+			 * For now always wait for the IO's completion during resowner
+			 * cleanup. This provides a bound on how long after an error or
+			 * pgaio_wref_abandon() an IO handle will show up as used and how
+			 * long an uncompleted IO can cause resources to be retained.
+			 *
+			 * It is quite possible that we eventually want to support IO
+			 * operations that last longer, e.g. for WAL writes in the
+			 * background. If so we will either need to use a longer lived
+			 * resowner or add a flag controlling when this cleanup happens.
+			 */
+			pgaio_io_wait(ioh, ioh->generation);
 			break;
 	}
 
-	/*
-	 * Need to unregister the reporting of the IO's result, the memory it's
-	 * referencing likely has gone away.
-	 */
-	if (ioh->report_return)
-		ioh->report_return = NULL;
-
 	RESUME_INTERRUPTS();
 }
 
@@ -1050,6 +1072,58 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 	return false;
 }
 
+/*
+ * Declare that a wait reference to an IO, started by this backend, is not of
+ * interest to this backend anymore. Once called, the PgAioReturn *ret passed
+ * to pgaio_io_acquire[_nb]() will not be updated anymore and thus can be
+ * freed.
+ */
+void
+pgaio_wref_abandon(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	bool		am_owner;
+	PgAioHandle *ioh;
+	PgAioHandleState state;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	/*
+	 * It is safe to perform this check before checking if the IO was recycled
+	 * (and before we hold interrupts) as the owner of an IO cannot change.
+	 */
+	if (!am_owner)
+		elog(ERROR, "only IOs owned by current backend can be abandoned");
+
+	/*
+	 * To ensure that the IO won't be recycled while we check (e.g. during the
+	 * emission of a debug message).
+	 */
+	HOLD_INTERRUPTS();
+
+	if (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "discarding result %p, resowner: %p",
+					   ioh->report_return, ioh->resowner);
+
+		if (state < PGAIO_HS_SUBMITTED)
+			elog(ERROR, "abandoning IO in wrong state: %d", state);
+
+		/*
+		 * All we need to do to abandon the IO is to clear its report_return
+		 * field. Without that we could end up writing to freed/reused memory
+		 * when the IO completes.
+		 */
+		if (ioh->report_return)
+			ioh->report_return = NULL;
+	}
+
+	RESUME_INTERRUPTS();
+}
+
 
 
 /* --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a318539e5..125d8babf 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -110,7 +110,8 @@ struct ReadStream
 	 * for IO combining even in cases where the I/O subsystem can keep up at a
 	 * low read-ahead distance, as doing larger IOs is more efficient.
 	 *
-	 * Set to 0 when the end of the stream is reached.
+	 * Set to 0 when the end of the stream is reached and to -1 when the
+	 * stream is reset.
 	 */
 	int16		combine_distance;
 	int16		readahead_distance;
@@ -553,8 +554,8 @@ read_stream_start_pending_read(ReadStream *stream)
 static inline bool
 read_stream_should_look_ahead(ReadStream *stream)
 {
-	/* If the callback has signaled end-of-stream, we're done */
-	if (stream->readahead_distance == 0)
+	/* If we reached end-of-stream or a reset, we're done */
+	if (stream->readahead_distance <= 0)
 		return false;
 
 	/* never start more IOs than our cap */
@@ -632,7 +633,7 @@ read_stream_should_issue_now(ReadStream *stream)
 	 * If the callback has signaled end-of-stream, start the pending read
 	 * immediately. There is no further potential for IO combining.
 	 */
-	if (stream->readahead_distance == 0)
+	if (stream->readahead_distance <= 0)
 		return true;
 
 	/*
@@ -740,7 +741,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->readahead_distance == 0);
+	Assert(stream->pinned_buffers > 0 || stream->readahead_distance <= 0);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -1146,8 +1147,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	{
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
-		/* End of stream reached?  */
-		if (stream->readahead_distance == 0)
+		/* End of stream / reset reached?  */
+		if (stream->readahead_distance <= 0)
 			return InvalidBuffer;
 
 		/*
@@ -1188,7 +1189,17 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->ios[io_index].op.buffers ==
 			   &stream->buffers[oldest_buffer_index]);
 
-		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
+		/*
+		 * If the stream has been reset, don't even wait for the IO, just
+		 * abandon it.
+		 */
+		if (stream->readahead_distance < 0)
+		{
+			AbandonReadBuffers(&stream->ios[io_index].op);
+			needed_wait = false;
+		}
+		else
+			needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
@@ -1420,8 +1431,8 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->readahead_distance = 0;
-	stream->combine_distance = 0;
+	stream->readahead_distance = -1;
+	stream->combine_distance = -1;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d6c0cc1f6..4b9615313 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1613,6 +1613,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * buffers must remain valid until WaitReadBuffers() is called, and any
  * forwarded buffers must also be preserved for a continuing call unless
  * they are explicitly released.
+ *
+ * If true was returned, the memory underlying the ReadBuffersOperation needs
+ * to stay around until either WaitReadBuffers() or AbandonReadBuffers() is
+ * called (or an error is thrown).
  */
 bool
 StartReadBuffers(ReadBuffersOperation *operation,
@@ -2174,6 +2178,36 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	return true;
 }
 
+/*
+ * Declare that this backend is not interested in the operation anymore. This
+ * needs to be called if StartReadBuffers() returned true and the
+ * ReadBuffersOperation is to be freed without calling WaitReadBuffers()
+ * (leaving errors aside).
+ *
+ * It is the caller's responsibility to release buffer pins (seems simpler
+ * that way, as that already is required if no IO had been necessary).
+ */
+void
+AbandonReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioWaitRef io_wref = operation->io_wref;
+
+	/* see equivalent WaitReadBuffers() check */
+	if (!pgaio_wref_valid(&io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "abandoning read operation that didn't read");
+
+	if (!pgaio_wref_valid(&io_wref))
+		return;
+
+	pgaio_wref_clear(&operation->io_wref);
+
+	/* can't abandon foreign IOs (nor do we need to) */
+	if (operation->foreign_io)
+		operation->foreign_io = false;
+	else
+		pgaio_wref_abandon(&io_wref);
+}
+
 /*
  * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement victim and
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 63cadd64c..9b364bbc3 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -15,6 +15,10 @@ use TestAio;
 my @methods = TestAio::supported_io_methods();
 my %nodes;
 
+# Putting it inline makes perltidy do ugly things
+my $count_my_aios_query =
+  'SELECT count(*) FROM pg_aios WHERE pid = pg_backend_pid()';
+
 
 ###
 # Create and configure one instance for each io_method
@@ -1616,6 +1620,61 @@ INSERT INTO tmp_ok SELECT generate_series(1, 5000);
 			qq|SELECT blockoff, blocknum, io_reqd and not foreign_io, nblocks FROM read_buffers('$table', 1, 3)|,
 			qr/^0\|1\|t\|2\n2\|3\|f\|1$/,
 			qr/^$/);
+
+
+		###
+		# Test that abandoning IO works and that it does not cause issues.
+		###
+
+		# Test that even after abandoning IO we do wait for the IOs at the end
+		# of the statement.
+		$psql_a->query_safe(qq|SET io_combine_limit=2|);
+		$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: read buffers & abandon it",
+			qq|SELECT abandoned, count(*) FROM read_buffers('$table', 0, 6, abandon_after => 2) GROUP BY 1 ORDER BY 1|,
+			qr/^f\|1\nt\|2$/,
+			qr/^$/);
+		# Due to the end-of-statement wait there should be no IOs anymore
+		is($psql_a->query_safe($count_my_aios_query), 0,
+			"$io_method: $persistency: abandoned IO completed by end of statement"
+		);
+
+		# Test that after abandoning IO buffer access still works.
+		#
+		# First test that by just issuing another read_buffers() in the same
+		# statement (so that the abandoned IOs aren't waited-for during
+		# end-of-statement resowner handling).
+		$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: read buffers & abandon it & read again",
+			qq|
+			   SELECT SUM(nblocks) FROM read_buffers('$table', 0, 4, abandon_after => 2)
+			   UNION ALL
+			   SELECT SUM(nblocks) FROM read_buffers('$table', 0, 4)
+			   |,
+			qr/^4\n4$/,
+			qr/^$/);
+
+		# Now test that a plain SELECT needing buffers affected by abandoned
+		# IO work
+		$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: read buffers & abandon it & read again via SELECT",
+			qq|
+			   SELECT SUM(nblocks) FROM read_buffers('$table', 0, 4, abandon_after => 2)
+			   UNION ALL
+			   SELECT SUM(data) FROM $table WHERE data < 1000
+			   |,
+			qr/^4\n499500$/,
+			qr/^$/);
+		$psql_a->query_safe(qq|RESET io_combine_limit|);
 	}
 
 	# The remaining tests don't make sense for temp tables, as they are
@@ -1837,6 +1896,60 @@ read_buffers('$table', 0, 4)|,
 	$psql_a->{stdout} = '';
 
 
+	###
+	# Test that abandoning IO actually avoids waiting for IO.
+	###
+
+	# Testing not waiting only works if the IO method doesn't execute IO
+	# synchronously, which is why we fundamentally can't test with
+	# io_method=sync. Furthermore we need to work around io_method=io_uring
+	# potentially executing the IO synchronously - we can do so by making the
+	# IOs big enough (c.f. pgaio_uring_should_use_async()).
+	#
+	# To make the test reliable we have to abandon all IOs, as waiting for
+	# some IOs could lead to also consuming the completion of the IO that will
+	# trigger a wait in the completion.
+	#
+	# XXX Temporarily disabled by pgeoghegan (remove the "0 &&" below when
+	# this is properly debugged).
+	#
+	# On -DRELCACHE_FORCE_RELEASE builds, this subtest shows
+	# "WARNING:  leaked AIO handle".  Needs to be debugged.
+	if (0 && $io_method ne 'sync')
+	{
+		$psql_a->query_safe(qq|SELECT evict_rel('$table')|);
+		$psql_a->query_safe(qq|SET io_combine_limit=5|);
+		$psql_b->query_safe(
+			qq/SELECT inj_io_completion_wait(
+		   relfilenode=>pg_relation_filenode('$table'),
+		   blockno=>4);/);
+		ok(1,
+			"$io_method: $persistency: configure wait in completion of block 4"
+		);
+
+		# Need to end the wait in the completion before the statement is over,
+		# otherwise we'll wait during resowner cleanup at the end of the
+		# statement.
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: read buffers abandoning blocked IO avoids wait",
+			qq|
+			   SELECT count(*) > 0 FROM read_buffers('$table', 0, 10, abandon_after => 0) WHERE abandoned
+			   UNION ALL
+			   SELECT count(*) > 0 FROM pg_aios WHERE pid = pg_backend_pid()
+			   UNION ALL
+			   SELECT inj_io_completion_continue() IS NOT NULL
+			|,
+			qr/^t\nt\nt$/,
+			qr/^$/);
+		is($psql_a->query_safe($count_my_aios_query), 0,
+			"$io_method: $persistency: abandoned IO is completed at end of statement"
+		);
+
+		$psql_a->query_safe(qq|RESET io_combine_limit|);
+	}
+
 	$psql_a->quit();
 	$psql_b->quit();
 	$psql_c->quit();
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 762ac2951..ff3b200cf 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -53,7 +53,7 @@ CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed boo
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION read_buffers(rel regclass, startblock int4, nblocks int4, OUT blockoff int4, OUT blocknum int4, OUT io_reqd bool, OUT foreign_io bool, OUT nblocks int4, OUT buf int4[])
+CREATE FUNCTION read_buffers(rel regclass, startblock int4, nblocks int4, abandon_after int4 DEFAULT -1, OUT blockoff int4, OUT blocknum int4, OUT io_reqd bool, OUT foreign_io bool, OUT abandoned bool, OUT nblocks int4, OUT buf int4[])
 RETURNS SETOF record STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 35efba1a5..64852c4e4 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -698,11 +698,13 @@ read_buffers(PG_FUNCTION_ARGS)
 	Oid			relid = PG_GETARG_OID(0);
 	BlockNumber startblock = PG_GETARG_UINT32(1);
 	int32		nblocks = PG_GETARG_INT32(2);
+	int32		abandon_after = PG_GETARG_INT32(3);
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	Relation	rel;
 	SMgrRelation smgr;
 	int			nblocks_done = 0;
 	int			nblocks_disp = 0;
+	int			nblocks_wait = 0;
 	int			nios = 0;
 	ReadBuffersOperation *operations;
 	Buffer	   *buffers;
@@ -724,6 +726,9 @@ read_buffers(PG_FUNCTION_ARGS)
 	rel = relation_open(relid, AccessShareLock);
 	smgr = RelationGetSmgr(rel);
 
+	if (abandon_after < 0)
+		abandon_after = nblocks;
+
 	/*
 	 * Do StartReadBuffers() until IO for all the required blocks has been
 	 * started (if required).
@@ -758,9 +763,17 @@ read_buffers(PG_FUNCTION_ARGS)
 	for (int nio = 0; nio < nios; nio++)
 	{
 		ReadBuffersOperation *operation = &operations[nio];
+		int			nblocks_this_io = nblocks_per_io[nio];
 
 		if (io_reqds[nio])
-			WaitReadBuffers(operation);
+		{
+			if (nblocks_wait < abandon_after)
+				WaitReadBuffers(operation);
+			else
+				AbandonReadBuffers(operation);
+		}
+
+		nblocks_wait += nblocks_this_io;
 	}
 
 	/*
@@ -770,8 +783,8 @@ read_buffers(PG_FUNCTION_ARGS)
 	{
 		ReadBuffersOperation *operation = &operations[nio];
 		int			nblocks_this_io = nblocks_per_io[nio];
-		Datum		values[6] = {0};
-		bool		nulls[6] = {0};
+		Datum		values[7] = {0};
+		bool		nulls[7] = {0};
 		ArrayType  *buffers_arr;
 
 		/* convert buffer array to datum array */
@@ -804,14 +817,18 @@ read_buffers(PG_FUNCTION_ARGS)
 		values[3] = BoolGetDatum(io_reqds[nio] ? operation->foreign_io : false);
 		nulls[3] = false;
 
-		/* nblocks */
-		values[4] = Int32GetDatum(nblocks_this_io);
+		/* abandoned */
+		values[4] = BoolGetDatum(nblocks_disp >= abandon_after);
 		nulls[4] = false;
 
-		/* array of buffers */
-		values[5] = PointerGetDatum(buffers_arr);
+		/* nblocks */
+		values[5] = Int32GetDatum(nblocks_this_io);
 		nulls[5] = false;
 
+		/* array of buffers */
+		values[6] = PointerGetDatum(buffers_arr);
+		nulls[6] = false;
+
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
 		nblocks_disp += nblocks_this_io;
@@ -1064,6 +1081,13 @@ inj_io_completion_wait_matches(PgAioHandle *ioh)
 		!(inj_blockno >= io_blockno && inj_blockno < (io_blockno + td->smgr.nblocks)))
 		return false;
 
+	ereport(LOG,
+			errmsg("wait injection point matches for IO %d, inj blockno %d, io blockno %d, io nblocks %d",
+				   pgaio_io_get_id(ioh),
+				   inj_blockno, io_blockno, td->smgr.nblocks
+				   ),
+			errhidestmt(true), errhidecontext(true));
+
 	return true;
 }
 
-- 
2.53.0



  [application/octet-stream] v28-0010-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch (6.3K, 4-v28-0010-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch)
  download | inline diff:
From 3f93123f6c43c9c1d5b2f7485827e9b5ab61d39d Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sun, 22 Mar 2026 15:12:41 -0400
Subject: [PATCH v28 10/11] aio: Fix pgaio_io_wait() for staged IOs (B).

Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion.  The owner only
promises to advance it to PGAIO_HS_SUBMITTED.  The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().

Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance.  The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.

It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change.  There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.

Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.

Backpatch-through: 18
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2BmZYrSdnhk-XrBYO18H829K77S9gMKUsykOiTJtqB43g%40mail.gmail.com
---
 src/include/storage/aio_internal.h            |  7 +++
 src/backend/storage/aio/aio.c                 | 50 ++++++++++++++++---
 src/backend/storage/aio/aio_init.c            |  1 +
 .../utils/activity/wait_event_names.txt       |  1 +
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 9ca4087aa..c3afde4d5 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -216,6 +216,13 @@ typedef struct PgAioBackend
 	uint16		num_staged_ios;
 	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
 
+	/*
+	 * Other backends sometimes need to wait for the owning backend to submit.
+	 * The per-IO CV would work for that purpose, but a per-backend CV allows
+	 * for just one broadcast per submitted batch.
+	 */
+	ConditionVariable submit_cv;
+
 	/*
 	 * List of in-flight IOs. Also contains IOs that aren't strictly speaking
 	 * in-flight anymore, but have been waited-for and completed by another
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 7009fb7f6..ce2f73bff 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -593,6 +593,16 @@ pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState
 	return ioh->generation != ref_generation;
 }
 
+/*
+ * Whether we need to wait via the IO method. Don't check via the IO method if
+ * the issuing backend is executing the IO synchronously.
+ */
+static bool
+pgaio_io_needs_wait_one(PgAioHandle *ioh)
+{
+	return pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS);
+}
+
 /*
  * Wait for IO to complete. External code should never use this, outside of
  * the AIO subsystem waits are only allowed via pgaio_wref_wait().
@@ -632,23 +642,38 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 				elog(ERROR, "IO in wrong state: %d", state);
 				break;
 
-			case PGAIO_HS_SUBMITTED:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
 
 				/*
-				 * If we need to wait via the IO method, do so now. Don't
-				 * check via the IO method if the issuing backend is executing
-				 * the IO synchronously.
+				 * The owner hasn't submitted the IO yet. If we need to wait
+				 * via the IO method, wait for submission, giving this backend
+				 * the chance to call ->wait_one().
 				 */
-				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				if (pgaio_io_needs_wait_one(ioh))
+				{
+					PgAioBackend *backend = &pgaio_ctl->backend_state[ioh->owner_procno];
+
+					ConditionVariablePrepareToSleep(&backend->submit_cv);
+					while (!pgaio_io_was_recycled(ioh, ref_generation, &state) &&
+						   (state == PGAIO_HS_DEFINED ||
+							state == PGAIO_HS_STAGED))
+						ConditionVariableSleep(&backend->submit_cv, WAIT_EVENT_AIO_IO_SUBMIT);
+					ConditionVariableCancelSleep();
+					continue;
+				}
+				pg_fallthrough;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/* If we need to wait via the IO method, do so now. */
+				if (pgaio_io_needs_wait_one(ioh))
 				{
 					pgaio_method_ops->wait_one(ioh, ref_generation);
 					continue;
 				}
 				pg_fallthrough;
 
-				/* waiting for owner to submit */
-			case PGAIO_HS_DEFINED:
-			case PGAIO_HS_STAGED:
 				/* waiting for reaper to complete */
 				/* fallthrough */
 			case PGAIO_HS_COMPLETED_IO:
@@ -1226,6 +1251,15 @@ pgaio_submit_staged(void)
 
 	pgaio_my_backend->num_staged_ios = 0;
 
+	/*
+	 * Wake any backend that started waiting for any of these IOs before
+	 * submission, if it is necessary to call ->wait_one() to guarantee
+	 * progress with the configured IO method.  On its side, pgaio_io_wait()
+	 * only waits for submit_cv on IO methods needing that.
+	 */
+	if (pgaio_method_ops->wait_one)
+		ConditionVariableBroadcast(&pgaio_my_backend->submit_cv);
+
 	pgaio_debug(DEBUG4,
 				"aio: submitted %d IOs",
 				total_submitted);
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index da30d792a..6fc00a917 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -199,6 +199,7 @@ AioShmemInit(void *arg)
 
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		ConditionVariableInit(&bs->submit_cv);
 		dclist_init(&bs->in_flight_ios);
 
 		/* initialize per-backend IOs */
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 560659f95..7c3326348 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -200,6 +200,7 @@ ABI_compatibility:
 Section: ClassName - WaitEventIO
 
 AIO_IO_COMPLETION	"Waiting for another process to complete IO."
+AIO_IO_SUBMIT	"Waiting for another process to submit IO."
 AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
 AIO_IO_URING_EXECUTION	"Waiting for IO execution via io_uring."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
-- 
2.53.0



  [application/octet-stream] v28-0001-Add-slot-based-table-AM-index-scan-interface.patch (121.4K, 5-v28-0001-Add-slot-based-table-AM-index-scan-interface.patch)
  download | inline diff:
From 8c1ea649566137ec805bf79f1bf251c1b3f210e6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 22 Mar 2026 02:36:57 -0400
Subject: [PATCH v28 01/11] Add slot-based table AM index scan interface.

Add table_index_getnext_slot, a new table AM callback that wraps both
plain and index-only index scans that use amgettuple.  Two new heapam
callbacks are introduced -- one for plain scans and one for index-only
scans -- which an upcoming commit that adds the amgetbatch interface
will expand to four.  The appropriate callback is resolved once in
index_scan_begin, and called through a function pointer on the
IndexScanDesc (xs_getnext_slot) when the table_index_getnext_slot shim
function is called from executor nodes.  That way table AMs can create
specialized variants to help the compiler produce more efficient code
(but they should always be able to provide exactly one generic callback,
if that makes sense).

This moves VM checks for index-only scans out of the executor and into
heapam, enabling batching of visibility map lookups (though for now we
continue to just perform retail lookups).  Using the new higher level
slot-based interface greatly simplifies nodeIndexonlyscan.c, which no
longer has to deal with the visibility map directly.  More importantly,
this is a significant architectural improvement: table AMs can now
implement index-only scans that are not tied to heapam's visibility map.

A small minority of callers (2 callers in total) fundamentally need to
pass a TID to the table AM (both perform constraint enforcement).  These
callers don't actually perform index scans (even if their TIDs are taken
from an index), and have no need for most of the index scan machinery.
Switch these callers over to the new fetch_tid interface (which replaces
the previous TID-based index_fetch_tuple interface).  All index scan
callers now use the new slot-based interface (table_index_getnext_slot).

The VISITED_PAGES_LIMIT mechanism used by get_actual_variable_range to
cap scan overhead during planning is reworked to go through a new scan
descriptor interface (xs_visited_pages_limit), rather than tracking the
costs directly and terminating the scan itself, in an ad-hoc way.  This
is necessary because callers that use the new slot-based interface no
longer have direct access to which heap blocks were fetched.  Similarly,
nodeIndexonlyscan.c can no longer use InstrCountTuples2 to count heap
fetches during an EXPLAIN ANALYZE.  EXPLAIN ANALYZE now obtains this
information from a new IndexScanInstrumentation field, which table AMs
are required to maintain.

Though independently useful, this commit is preparatory work for an
upcoming commit that will add an amgetbatch index AM interface, where
the table AM takes full responsibility for managing the progress of
index scans.  That will move most of the implementation of scrollable
cursors out of index AMs and into table AMs, making it essential that
executor nodes pass the current scan direction down to the table AM.

The heapam table_index_getnext_slot callbacks make aggressive use of
forced inlining to ensure that plain and index-only code paths are fully
specialized at compile time despite sharing a common implementation.
Testing has shown this is necessary to keep icache misses to a minimum,
at least with the two upcoming amgetbatch variants.

Author: Peter Geoghegan <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/genam.h                    |   5 +-
 src/include/access/heapam.h                   |  32 +-
 src/include/access/relscan.h                  |  67 ++-
 src/include/access/tableam.h                  | 220 +++++----
 src/include/executor/instrument_node.h        |   5 +-
 src/include/nodes/execnodes.h                 |   8 -
 src/backend/access/heap/heapam_handler.c      |  17 +-
 src/backend/access/heap/heapam_indexscan.c    | 463 ++++++++++++++++--
 src/backend/access/heap/visibilitymap.c       |  27 +-
 src/backend/access/index/genam.c              |  30 +-
 src/backend/access/index/indexam.c            | 256 ++++------
 src/backend/access/nbtree/nbtinsert.c         |  10 +-
 src/backend/access/table/tableam.c            |  26 +-
 src/backend/access/table/tableamapi.c         |   8 +-
 src/backend/commands/constraint.c             |  28 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/repack.c                 |   7 +-
 src/backend/executor/execIndexing.c           |   8 +-
 src/backend/executor/execReplication.c        |  14 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 241 +--------
 src/backend/executor/nodeIndexscan.c          |  16 +-
 src/backend/utils/adt/ri_triggers.c           |  16 +-
 src/backend/utils/adt/selfuncs.c              |  97 +---
 src/test/modules/index/Makefile               |  10 +
 src/test/modules/index/expected/hot_chain.out |  56 +++
 src/test/modules/index/meson.build            |  26 +
 src/test/modules/index/sql/hot_chain.sql      |  37 ++
 .../modules/index/test_indexscan--1.0.sql     |  13 +
 src/test/modules/index/test_indexscan.c       | 146 ++++++
 src/test/modules/index/test_indexscan.control |   4 +
 src/test/regress/expected/stats.out           |  62 +++
 src/test/regress/sql/stats.sql                |  29 ++
 src/tools/pgindent/typedefs.list              |   3 +-
 34 files changed, 1287 insertions(+), 724 deletions(-)
 create mode 100644 src/test/modules/index/expected/hot_chain.out
 create mode 100644 src/test/modules/index/sql/hot_chain.sql
 create mode 100644 src/test/modules/index/test_indexscan--1.0.sql
 create mode 100644 src/test/modules/index/test_indexscan.c
 create mode 100644 src/test/modules/index/test_indexscan.control

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 68bfe405d..05eec9204 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -156,6 +156,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool index_only_scan,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys,
@@ -178,15 +179,13 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool index_only_scan,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan,
 											  uint32 flags);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5176478c2..4fea51761 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -115,12 +115,10 @@ typedef struct BitmapHeapScanDescData
 typedef struct BitmapHeapScanDescData *BitmapHeapScanDesc;
 
 /*
- * Descriptor for fetches from heap via an index.
+ * heapam-specific IndexScanDescData opaque state
  */
-typedef struct IndexFetchHeapData
+typedef struct IndexScanHeapData
 {
-	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
-
 	/*
 	 * Current heap buffer in scan (and its block number), if any.  NB: if
 	 * xs_blk is not InvalidBlockNumber, we hold a pin in xs_cbuf.
@@ -128,9 +126,16 @@ typedef struct IndexFetchHeapData
 	Buffer		xs_cbuf;
 	BlockNumber xs_blk;
 
-	/* Current heap block's corresponding page in the visibility map */
-	Buffer		xs_vmbuffer;
-} IndexFetchHeapData;
+	/* For visibility map checks (index-only scans and on-access pruning) */
+	Buffer		xs_vmbuffer;	/* visibility map buffer */
+
+	bool		xs_readonly;	/* scan is read-only? */
+
+	uint16		xs_blkswitch_count; /* number of heap blocks fetched */
+
+	/* Per-tuple context for padding "name" columns during index-only scans */
+	MemoryContext xs_itup_cxt;
+} IndexScanHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
 typedef enum
@@ -430,16 +435,15 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
 											  TM_IndexDeleteOp *delstate);
 
 /* in heap/heapam_indexscan.c */
-extern IndexFetchTableData *heapam_index_fetch_begin(Relation rel, uint32 flags);
-extern void heapam_index_fetch_reset(IndexFetchTableData *scan);
-extern void heapam_index_fetch_end(IndexFetchTableData *scan);
+extern bool heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot,
+							 TupleTableSlot *slot, bool *all_dead);
+extern void heapam_index_scan_begin(IndexScanDesc scan, uint32 flags);
+extern void heapam_index_scan_reset(IndexScanDesc scan);
+extern void heapam_index_scan_end(IndexScanDesc scan);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
 								   bool *all_dead, bool first_call);
-extern bool heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
-									 ItemPointer tid, Snapshot snapshot,
-									 TupleTableSlot *slot, bool *heap_continue,
-									 bool *all_dead);
+extern void heap_fill_ios_slot(IndexScanDesc scan, TupleTableSlot *slot);
 
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 2ea06a67a..e2e2150da 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/relfilelocator.h"
@@ -25,6 +26,7 @@
 
 struct ParallelTableScanDescData;
 struct TableScanInstrumentation;
+struct TupleTableSlot;
 
 /*
  * Generic descriptor for table scans. This is the base-class for table scans,
@@ -120,22 +122,6 @@ typedef struct ParallelBlockTableScanWorkerData
 } ParallelBlockTableScanWorkerData;
 typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 
-/*
- * Base class for fetches from a table via an index. This is the base-class
- * for such scans, which needs to be embedded in the respective struct for
- * individual AMs.
- */
-typedef struct IndexFetchTableData
-{
-	Relation	rel;
-
-	/*
-	 * Bitmask of ScanOptions affecting the relation. No SO_INTERNAL_FLAGS are
-	 * permitted.
-	 */
-	uint32		flags;
-} IndexFetchTableData;
-
 struct IndexScanInstrumentation;
 
 /*
@@ -172,10 +158,10 @@ typedef struct IndexScanDescData
 	struct IndexScanInstrumentation *instrument;
 
 	/*
-	 * In an index-only scan, a successful amgettuple call must fill either
-	 * xs_itup (and xs_itupdesc) or xs_hitup (and xs_hitupdesc) to provide the
-	 * data returned by the scan.  It can fill both, in which case the heap
-	 * format will be used.
+	 * In an index-only scan, the index AM fills either xs_itup or xs_hitup
+	 * with the data to be returned by the scan (it can fill both, in which
+	 * case the heap format is used).  The table AM consumes these to fill the
+	 * caller's slot during table_index_getnext_slot.
 	 */
 	IndexTuple	xs_itup;		/* index tuple returned by AM */
 	struct TupleDescData *xs_itupdesc;	/* rowtype descriptor of xs_itup */
@@ -185,9 +171,27 @@ typedef struct IndexScanDescData
 	ItemPointerData xs_heaptid; /* result */
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
-	IndexFetchTableData *xs_heapfetch;
 
-	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	/*
+	 * xs_recheck is set by index AMs, and read by table AMs.
+	 *
+	 * Should not be checked by core executor nodes (they should use the
+	 * xs_getnext_slot callback's recheck argument instead).
+	 */
+	bool		xs_recheck;
+
+	/* Table access method's private state (not used during bitmap scans) */
+	void	   *xs_table_opaque;
+
+	/*
+	 * Resolved table_index_getnext_slot callback, which is set by
+	 * table_index_scan_begin at the start of amgettuple scans.  Reports via
+	 * *recheck whether the scan keys must be rechecked.
+	 */
+	bool		(*xs_getnext_slot) (struct IndexScanDescData *scan,
+									ScanDirection direction,
+									struct TupleTableSlot *slot,
+									bool *recheck);
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -195,11 +199,28 @@ typedef struct IndexScanDescData
 	 * xs_recheckorderby is true, these need to be rechecked just like the
 	 * scan keys, and the values returned here are a lower-bound on the actual
 	 * values.
+	 *
+	 * Note: unlike xs_recheck, these fields are read by core executor nodes.
 	 */
 	Datum	   *xs_orderbyvals;
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/*
+	 * Index attributes holding "name" columns stored as cstrings, which the
+	 * table AM re-pads to NAMEDATALEN when filling a slot from xs_itup
+	 */
+	AttrNumber *xs_name_cstring_attnums;
+	int			xs_name_cstring_count;
+
+	/*
+	 * An approximate limit on the amount of work, measured in pages touched,
+	 * imposed on the index scan.  The default, 0, means no limit.  Only
+	 * honored during index-only scans.  Used by selfuncs.c to bound the cost
+	 * of get_actual_variable_endpoint().
+	 */
+	uint8		xs_visited_pages_limit;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 } IndexScanDescData;
@@ -213,8 +234,6 @@ typedef struct ParallelIndexScanDescData
 	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
 }			ParallelIndexScanDescData;
 
-struct TupleTableSlot;
-
 /* Struct for storage-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f2c36696b..8f268e4d8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -38,6 +38,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
 typedef struct IndexInfo IndexInfo;
 typedef struct SampleScanState SampleScanState;
 typedef struct ScanKeyData ScanKeyData;
+typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct ValidateIndexState ValidateIndexState;
 typedef struct VacuumParams VacuumParams;
 
@@ -446,60 +447,70 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch tuples from the relation, as needed when fetching
-	 * tuples for an index scan.  The callback has to return an
-	 * IndexFetchTableData, which the AM will typically embed in a larger
-	 * structure with additional information.
+	 * Prepare for an index scan of the table.  The callback stores its own
+	 * private scan state in the index scan descriptor's xs_table_opaque field
+	 * (an opaque pointer).
 	 *
 	 * flags is a bitmask of ScanOptions affecting underlying table scan
 	 * behavior. See scan_begin() for more information on passing these.
 	 *
-	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 * Callback is responsible for setting IndexScanDesc.xs_getnext_slot to
+	 * the appropriate slot-based callback.  Tuples are then returned through
+	 * the caller's slot, via table_index_getnext_slot().  No separate
+	 * slot-based callback exists in this struct!
+	 *
+	 * In principle a single general-purpose callback (stored here) would
+	 * suffice, but using specialized variants allows the table AM to provide
+	 * minimal code based on conditions that are fixed for the whole scan as
+	 * an optimization (e.g., variants for plain index scans and index-only
+	 * scans, each with fewer branches).
+	 *
+	 * The xs_getnext_slot callback may rely on the slot type that callers are
+	 * required to pass: plain index scans use a slot of the table AM's own
+	 * preferred type (see table_slot_callbacks), while index-only scans
+	 * always use a virtual slot, since the slot is filled from index data
+	 * rather than from the table.
+	 *
+	 * Note that AMs that do not necessarily update indexes when indexed
+	 * columns do not change, need to return the current/correct version of
+	 * the tuple that is visible to the snapshot, even if the tid points to an
+	 * older version of the tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags);
+	void		(*index_scan_begin) (IndexScanDesc scan, uint32 flags);
 
 	/*
-	 * Reset index fetch. Typically this will release cross index fetch
-	 * resources held in IndexFetchTableData.
+	 * Inform the table AM that there's to be either a rescan or a restore of
+	 * a marked position, or that the scan has run out of index entries.
 	 */
-	void		(*index_fetch_reset) (struct IndexFetchTableData *data);
+	void		(*index_scan_reset) (IndexScanDesc scan);
 
 	/*
-	 * Release resources and deallocate index fetch.
+	 * Release resources and deallocate index scan state.
+	 */
+	void		(*index_scan_end) (IndexScanDesc scan);
+
+	/* ------------------------------------------------------------------------
+	 * Callbacks for non-modifying operations on individual tuples
+	 * ------------------------------------------------------------------------
 	 */
-	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
-	 * Note that AMs that do not necessarily update indexes when indexed
-	 * columns do not change, need to return the current/correct version of
-	 * the tuple that is visible to the snapshot, even if the tid points to an
-	 * older version of the tuple.
+	 * This is a lower-level callback for single-shot TID lookups used by
+	 * constraint enforcement code (unique checks and similar).
 	 *
-	 * *call_again is false on the first call to index_fetch_tuple for a tid.
-	 * If there potentially is another tuple matching the tid, *call_again
-	 * needs to be set to true by index_fetch_tuple, signaling to the caller
-	 * that index_fetch_tuple should be called again for the same tid.
-	 *
-	 * *all_dead, if all_dead is not NULL, should be set to true by
-	 * index_fetch_tuple iff it is guaranteed that no backend needs to see
-	 * that tuple. Index AMs can use that to avoid returning that tid in
-	 * future searches.
-	 */
-	bool		(*index_fetch_tuple) (struct IndexFetchTableData *scan,
-									  ItemPointer tid,
-									  Snapshot snapshot,
-									  TupleTableSlot *slot,
-									  bool *call_again, bool *all_dead);
-
-
-	/* ------------------------------------------------------------------------
-	 * Callbacks for non-modifying operations on individual tuples
-	 * ------------------------------------------------------------------------
+	 * *all_dead, if all_dead is not NULL, should be set to true by fetch_tid
+	 * iff it is guaranteed that no backend needs to see that tuple. Index AMs
+	 * can use that to avoid returning that tid in future searches.
 	 */
+	bool		(*fetch_tid) (Relation rel,
+							  ItemPointer tid,
+							  Snapshot snapshot,
+							  TupleTableSlot *slot,
+							  bool *all_dead);
 
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
@@ -1235,15 +1246,16 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  */
 
 /*
- * Prepare to fetch tuples from the relation, as needed when fetching tuples
- * for an index scan.
+ * Prepare for an index scan of the relation.  The callback stores its
+ * private scan state in the scan's xs_table_opaque field.
+ *
+ * Tuples for an index scan are then returned through the caller's slot, via
+ * table_index_getnext_slot().
  *
  * flags is a bitmask of ScanOptions. No SO_INTERNAL_FLAGS are permitted.
- *
- * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
-static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel, uint32 flags)
+static inline void
+table_index_scan_begin(IndexScanDesc scan, uint32 flags)
 {
 	Assert((flags & SO_INTERNAL_FLAGS) == 0);
 
@@ -1255,74 +1267,109 @@ table_index_fetch_begin(Relation rel, uint32 flags)
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "scan started during logical decoding");
 
-	return rel->rd_tableam->index_fetch_begin(rel, flags);
+	scan->heapRelation->rd_tableam->index_scan_begin(scan, flags);
 }
 
 /*
- * Reset index fetch. Typically this will release cross index fetch resources
- * held in IndexFetchTableData.
+ * Inform the table AM that there's to be either a rescan or a restore of a
+ * marked position, or that the scan has run out of index entries.
  */
 static inline void
-table_index_fetch_reset(struct IndexFetchTableData *scan)
+table_index_scan_reset(IndexScanDesc scan)
 {
-	scan->rel->rd_tableam->index_fetch_reset(scan);
+	Assert(scan->xs_table_opaque);
+
+	scan->heapRelation->rd_tableam->index_scan_reset(scan);
 }
 
 /*
- * Release resources and deallocate index fetch.
+ * Release resources and deallocate the table AM's private index scan state
+ * (the scan's xs_table_opaque)
  */
 static inline void
-table_index_fetch_end(struct IndexFetchTableData *scan)
+table_index_scan_end(IndexScanDesc scan)
 {
-	scan->rel->rd_tableam->index_fetch_end(scan);
+	Assert(scan->xs_table_opaque);
+
+	scan->heapRelation->rd_tableam->index_scan_end(scan);
 }
 
 /*
- * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
- * a visibility test according to `snapshot`. If a tuple was found and passed
- * the visibility test, returns true, false otherwise. Note that *tid may be
- * modified when we return true (see later remarks on multiple row versions
- * reachable via a single index entry).
+ * Return the next tuple from an index scan through `slot`, scanning in the
+ * specified direction.  Returns true if a tuple satisfying the scan keys and
+ * the snapshot was found, false otherwise.
  *
- * *call_again needs to be false on the first call to table_index_fetch_tuple() for
- * a tid. If there potentially is another tuple matching the tid, *call_again
- * will be set to true, signaling that table_index_fetch_tuple() should be called
- * again for the same tid.
+ * For a plain index scan the slot holds the table tuple, and so must be of
+ * the table AM's preferred slot type (see table_slot_callbacks).  For an
+ * index-only scan the table AM instead fills the slot from the index data the
+ * index AM placed in scan->xs_itup/xs_hitup; the slot must be virtual, since
+ * its contents don't come from the table at all.  Caller must not read
+ * xs_itup/xs_hitup IndexScanDesc fields directly.
  *
- * *all_dead, if all_dead is not NULL, will be set to true by
- * table_index_fetch_tuple() iff it is guaranteed that no backend needs to see
- * that tuple. Index AMs can use that to avoid returning that tid in future
- * searches.
+ * Dispatches through scan->xs_getnext_slot, which is resolved once by
+ * the table AM's index_scan_begin callback.
  *
- * The difference between this function and table_tuple_fetch_row_version()
- * is that this function returns the currently visible version of a row if
- * the AM supports storing multiple row versions reachable via a single index
- * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
- * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
- * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ * *recheck is set (on a true return) to indicate whether the scan keys must
+ * be rechecked against the returned tuple.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * released by a future table_index_getnext_slot or table_index_scan_end call.
+ *
+ * Note: for ordered scans, the caller must check scan->xs_recheckorderby and
+ * recheck the ORDER BY expressions for itself.
  */
 static inline bool
-table_index_fetch_tuple(struct IndexFetchTableData *scan,
-						ItemPointer tid,
-						Snapshot snapshot,
-						TupleTableSlot *slot,
-						bool *call_again, bool *all_dead)
+table_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						 TupleTableSlot *slot, bool *recheck)
 {
-	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
-													slot, call_again,
-													all_dead);
+	Assert(TTS_IS_VIRTUAL(slot) || !scan->xs_want_itup);
+	Assert(scan->xs_table_opaque);
+
+	return scan->xs_getnext_slot(scan, direction, slot, recheck);
 }
 
 /*
- * This is a convenience wrapper around table_index_fetch_tuple() which
- * returns whether there are table tuple items corresponding to an index
- * entry.  This likely is only useful to verify if there's a conflict in a
- * unique index.
+ * Fetch tuple at `tid` into `slot`, after doing a visibility test according
+ * to `snapshot`. If a tuple was found and passed the visibility test, returns
+ * true, false otherwise.  This is a low-level interface designed for use by
+ * constraint enforcement code, where passing a TID can't be avoided.
+ *
+ * Note that *tid may be modified when we return true (e.g. due to following a
+ * HOT chain in a heapam table).  Caller should consider passing a pointer to
+ * a mutable copy of their original TID to avoid unwanted side-effects.
+ *
+ * If all_dead is not NULL, *all_dead will be set to true here iff it is
+ * guaranteed that no backend needs to see any tuple reachable through
+ * caller's TID.  This means that it is safe to mark an index tuple containing
+ * this TID as LP_DEAD.
+ *
+ * The main difference between table_tuple_fetch_row_version() and this
+ * function is that we return the currently visible version of a row, which
+ * matters with AMs that support storing multiple row versions reachable via a
+ * single TID (e.g., due to heapam's HOT chains).  To reliably evaluate
+ * exactly the tuple at `tid`, call table_tuple_fetch_row_version() instead.
  */
-extern bool table_index_fetch_tuple_check(Relation rel,
-										  ItemPointer tid,
-										  Snapshot snapshot,
-										  bool *all_dead);
+static inline bool
+table_fetch_tid(Relation rel,
+				ItemPointer tid,
+				Snapshot snapshot,
+				TupleTableSlot *slot,
+				bool *all_dead)
+{
+	return rel->rd_tableam->fetch_tid(rel, tid, snapshot, slot, all_dead);
+}
+
+/*
+ * Convenience wrapper around table_fetch_tid() for callers that just need to
+ * check if a tuple is visible.
+ *
+ * Caller should note the table_fetch_tid warning about *tid being modified
+ * when we return true in some cases.
+ */
+extern bool table_fetch_tid_check(Relation rel,
+								  ItemPointer tid,
+								  Snapshot snapshot,
+								  bool *all_dead);
 
 
 /* ------------------------------------------------------------------------
@@ -1336,9 +1383,8 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  * `snapshot`. If a tuple was found and passed the visibility test, returns
  * true, false otherwise.
  *
- * See table_index_fetch_tuple's comment about what the difference between
- * these functions is. It is correct to use this function outside of index
- * entry->table tuple lookups.
+ * See table_fetch_tid's comment about what the difference between these
+ * functions is.
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 407699040..41a7d33f1 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -98,13 +98,16 @@ AccumulateIOStats(IOStats *dst, IOStats *src)
 
 
 /* ---------------------
- *	Instrumentation information for indexscans (amgettuple and amgetbitmap)
+ *	Instrumentation information for index scans (used by all AM interfaces)
  * ---------------------
  */
 typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/* Table tuples fetched count (incremented during index-only scans) */
+	uint64		ntabletuplefetches;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 53c138310..9eccaf250 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1790,11 +1790,7 @@ typedef struct IndexScanState
  *		ScanDesc		   index scan descriptor
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
- *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
- *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
- *		NameCStringCount   number of elements in the NameCStringAttNums array
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1813,11 +1809,7 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	IndexScanInstrumentation *ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
-	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
-	AttrNumber *ioss_NameCStringAttNums;
-	int			ioss_NameCStringCount;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2268cc277..f0e8d091a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -679,7 +679,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex,
+		indexScan = index_beginscan(OldHeap, OldIndex, false,
 									snapshot ? snapshot : SnapshotAny,
 									NULL, 0, 0,
 									SO_NONE);
@@ -722,11 +722,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			bool		recheck;
+
+			if (!table_index_getnext_slot(indexScan, ForwardScanDirection,
+										  slot, &recheck))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
-			if (indexScan->xs_recheck)
+			if (recheck)
 				elog(ERROR, "CLUSTER does not support lossy index conditions");
 		}
 		else
@@ -2679,10 +2682,9 @@ static const TableAmRoutine heapam_methods = {
 	.parallelscan_initialize = table_block_parallelscan_initialize,
 	.parallelscan_reinitialize = table_block_parallelscan_reinitialize,
 
-	.index_fetch_begin = heapam_index_fetch_begin,
-	.index_fetch_reset = heapam_index_fetch_reset,
-	.index_fetch_end = heapam_index_fetch_end,
-	.index_fetch_tuple = heapam_index_fetch_tuple,
+	.index_scan_begin = heapam_index_scan_begin,
+	.index_scan_reset = heapam_index_scan_reset,
+	.index_scan_end = heapam_index_scan_end,
 
 	.tuple_insert = heapam_tuple_insert,
 	.tuple_insert_speculative = heapam_tuple_insert_speculative,
@@ -2692,6 +2694,7 @@ static const TableAmRoutine heapam_methods = {
 	.tuple_update = heapam_tuple_update,
 	.tuple_lock = heapam_tuple_lock,
 
+	.fetch_tid = heapam_fetch_tid,
 	.tuple_fetch_row_version = heapam_fetch_row_version,
 	.tuple_get_latest_tid = heap_get_latest_tid,
 	.tuple_tid_valid = heapam_tuple_tid_valid,
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 33d14f1de..32d6aff1d 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -14,36 +14,119 @@
  */
 #include "postgres.h"
 
+#include "access/amapi.h"
 #include "access/heapam.h"
 #include "access/relscan.h"
+#include "access/visibilitymap.h"
 #include "storage/predicate.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/pgstat_internal.h"
 
 
+static bool heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan,
+												  ScanDirection direction,
+												  TupleTableSlot *slot,
+												  bool *recheck);
+static bool heapam_index_only_tuple_getnext_slot(IndexScanDesc scan,
+												 ScanDirection direction,
+												 TupleTableSlot *slot,
+												 bool *recheck);
+static pg_attribute_always_inline bool heapam_index_getnext_slot(IndexScanDesc scan,
+																 ScanDirection direction,
+																 TupleTableSlot *slot,
+																 bool index_only,
+																 bool *recheck);
+static pg_attribute_always_inline bool heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
+															   TupleTableSlot *slot, bool index_only,
+															   bool *heap_continue, bool *all_dead);
+static pg_attribute_always_inline void heapam_index_kill_item(IndexScanDesc scan);
+
+/*
+ * Simple, single-shot TID lookup for constraint enforcement code (unique
+ * checks and similar).  This is essentially just a heap_hot_search_buffer
+ * wrapper.
+ *
+ * This isn't actually related to index scans, but keeping it near
+ * heap_hot_search_buffer can help the compiler generate better code.
+ */
+bool
+heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot,
+				 TupleTableSlot *slot, bool *all_dead)
+{
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	Buffer		buf;
+	bool		found;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
+
+	buf = ReadBuffer(rel, ItemPointerGetBlockNumber(tid));
+
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+	found = heap_hot_search_buffer(tid, rel, buf, snapshot,
+								   &bslot->base.tupdata, all_dead, true);
+	bslot->base.tupdata.t_self = *tid;
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (found)
+	{
+		slot->tts_tableOid = RelationGetRelid(rel);
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata, slot,
+									   buf);
+	}
+	else
+		ReleaseBuffer(buf);
+
+	return found;
+}
+
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
  * ------------------------------------------------------------------------
  */
 
-IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel, uint32 flags)
+void
+heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexScanHeapData *hscan = palloc0_object(IndexScanHeapData);
 
-	hscan->xs_base.rel = rel;
-	hscan->xs_base.flags = flags;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->xs_vmbuffer = InvalidBuffer;
 
-	return &hscan->xs_base;
+	/* Remember if scan is read-only */
+	hscan->xs_readonly = (flags & SO_HINT_REL_READ_ONLY) != 0;
+
+	/* Resolve which xs_getnext_slot implementation to use for this scan */
+	if (scan->xs_want_itup)
+		scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot;
+	else
+		scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot;
+
+	/*
+	 * Index-only scans that return "name" columns stored as cstrings need a
+	 * per-tuple context to re-pad them to NAMEDATALEN while filling the slot
+	 * (see heap_fill_ios_slot).  xs_name_cstring_count is set by
+	 * index_beginscan before we get here.
+	 */
+	if (scan->xs_want_itup && scan->xs_name_cstring_count > 0)
+		hscan->xs_itup_cxt = AllocSetContextCreate(CurrentMemoryContext,
+												   "index-only scan name columns",
+												   ALLOCSET_SMALL_SIZES);
+
+	/* Expose heapam's private scan state through the scan's opaque pointer */
+	scan->xs_table_opaque = hscan;
 }
 
 void
-heapam_index_fetch_reset(IndexFetchTableData *scan)
+heapam_index_scan_reset(IndexScanDesc scan)
 {
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+
+	/* Heap fetches from the last rescan don't count towards this limit  */
+	hscan->xs_blkswitch_count = 0;
+
 	/*
-	 * Resets are a no-op.
-	 *
 	 * Deliberately avoid dropping pins now held in xs_cbuf and xs_vmbuffer.
 	 * This saves cycles during certain tight nested loop joins (it can avoid
 	 * repeated pinning and unpinning of the same buffer across rescans).
@@ -51,9 +134,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 }
 
 void
-heapam_index_fetch_end(IndexFetchTableData *scan)
+heapam_index_scan_end(IndexScanDesc scan)
 {
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
 
 	/* drop pin if there's a pinned heap page */
 	if (BufferIsValid(hscan->xs_cbuf))
@@ -63,6 +146,10 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 	if (BufferIsValid(hscan->xs_vmbuffer))
 		ReleaseBuffer(hscan->xs_vmbuffer);
 
+	/* Free the index-only scan name-column context, if any */
+	if (hscan->xs_itup_cxt)
+		MemoryContextDelete(hscan->xs_itup_cxt);
+
 	pfree(hscan);
 }
 
@@ -228,18 +315,286 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	return false;
 }
 
-bool
-heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
-						 ItemPointer tid,
-						 Snapshot snapshot,
-						 TupleTableSlot *slot,
-						 bool *heap_continue, bool *all_dead)
+/*
+ * Fill an index-only scan's result slot from the data the index AM returned.
+ *
+ * The data is provided in either HeapTuple (xs_hitup) or IndexTuple (xs_itup)
+ * format.  An index AM may fill both, in which case the heap format is used,
+ * since it's a bit cheaper to fill a slot from.  "name" columns stored as
+ * cstrings (e.g. btree name_ops) are re-padded to NAMEDATALEN allocations,
+ * which live in the heap AM's per-tuple xs_itup_cxt (reset here on each call).
+ *
+ * This reads the table AM's private scan-result fields (xs_itup, xs_hitup,
+ * etc.); doing so is the table AM's job, which is why the executor and planner
+ * receive a filled slot from table_index_getnext_slot instead.
+ *
+ * This is exported for use by other table AMs.
+ */
+void
+heap_fill_ios_slot(IndexScanDesc scan, TupleTableSlot *slot)
 {
-	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
-	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	if (scan->xs_hitup)
+	{
+		/*
+		 * We don't take the trouble to verify that the provided tuple has
+		 * exactly the slot's format, but it seems worth doing a quick check
+		 * on the number of fields.
+		 */
+		Assert(slot->tts_tupleDescriptor->natts ==
+			   scan->xs_hitupdesc->natts);
+		ExecForceStoreHeapTuple(scan->xs_hitup, slot, false);
+	}
+	else if (scan->xs_itup)
+	{
+		TupleDesc	itupdesc = scan->xs_itupdesc;
+
+		/*
+		 * Note: we must use the tupdesc supplied by the AM in
+		 * index_deform_tuple, not the slot's tupdesc, in case the latter has
+		 * different datatypes (this happens for btree name_ops in
+		 * particular). They'd better have the same number of columns though,
+		 * as well as being datatype-compatible which is something we can't so
+		 * easily check.
+		 */
+		Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
+
+		ExecClearTuple(slot);
+		index_deform_tuple(scan->xs_itup, itupdesc,
+						   slot->tts_values, slot->tts_isnull);
+
+		/*
+		 * Copy all name columns stored as cstrings back into a NAMEDATALEN
+		 * byte sized allocation.  We mark this branch as unlikely as
+		 * generally "name" is used only for the system catalogs and this
+		 * would have to be a user query running on those or some other user
+		 * table with an index on a name column.
+		 */
+		if (unlikely(scan->xs_name_cstring_attnums != NULL))
+		{
+			IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+
+			/* free the previous tuple's name allocations */
+			MemoryContextReset(hscan->xs_itup_cxt);
+
+			for (int idx = 0; idx < scan->xs_name_cstring_count; idx++)
+			{
+				int			attnum = scan->xs_name_cstring_attnums[idx];
+				Name		name;
+
+				/* skip null Datums */
+				if (slot->tts_isnull[attnum])
+					continue;
+
+				name = (Name) MemoryContextAlloc(hscan->xs_itup_cxt,
+												 NAMEDATALEN);
+
+				/* use namestrcpy to zero-pad all trailing bytes */
+				namestrcpy(name, DatumGetCString(slot->tts_values[attnum]));
+				slot->tts_values[attnum] = NameGetDatum(name);
+			}
+		}
+
+		ExecStoreVirtualTuple(slot);
+	}
+	else
+		elog(ERROR, "no data returned for index-only scan");
+}
+
+/* xs_getnext_slot callback: amgettuple, plain index scan */
+static pg_attribute_hot bool
+heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan,
+									  ScanDirection direction,
+									  TupleTableSlot *slot,
+									  bool *recheck)
+{
+	Assert(!scan->xs_want_itup);
+	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, false, recheck);
+}
+
+/* xs_getnext_slot callback: amgettuple, index-only scan */
+static pg_attribute_hot bool
+heapam_index_only_tuple_getnext_slot(IndexScanDesc scan,
+									 ScanDirection direction,
+									 TupleTableSlot *slot,
+									 bool *recheck)
+{
+	Assert(scan->xs_want_itup);
+	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, true, recheck);
+}
+
+/*
+ * Common implementation for both heapam_index_*_getnext_slot variants.
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  On success the slot is filled: for plain index
+ * scans with the heap tuple; for index-only scans with the index data (from
+ * xs_itup/xs_hitup, via heap_fill_ios_slot).  *recheck reports whether the
+ * scan keys must be rechecked (only meaningful on a true return).
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to heapam_index_scan_end
+ * through index_endscan).
+ *
+ * The index_only parameter is a compile-time constant at each call site,
+ * allowing the compiler to specialize the code for each variant.
+ */
+static pg_attribute_always_inline bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot, bool index_only,
+						  bool *recheck)
+{
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+	bool	   *heap_continue = &scan->xs_heap_continue;
+	bool		all_visible = false;
+	ItemPointer tid = NULL;
+
+	Assert(TransactionIdIsValid(RecentXmin));
+	Assert(index_only || scan->xs_visited_pages_limit == 0);
+
+	for (;;)
+	{
+		if (!*heap_continue)
+		{
+			/* Get the next TID from the index */
+			tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			/* For index-only scans, check the visibility map */
+			if (index_only)
+				all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+											 ItemPointerGetBlockNumber(tid),
+											 &hscan->xs_vmbuffer);
+		}
+
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+		if (!index_only || !all_visible)
+		{
+			bool		all_dead;
+
+			/*
+			 * Plain index scan, or index-only scan that requires a heap fetch
+			 * to verify item's visibility
+			 */
+			if (index_only && scan->instrument)
+				scan->instrument->ntabletuplefetches++;
+
+			if (!heapam_index_heap_fetch(scan, hscan,
+										 index_only ? NULL : slot,
+										 index_only, heap_continue,
+										 &all_dead))
+			{
+				/* No visible tuple */
+				if (all_dead)
+					heapam_index_kill_item(scan);
+
+				/*
+				 * If caller set a visited-pages limit (only selfuncs.c's
+				 * index-only scans do this), give up once we've visited too
+				 * many distinct heap pages
+				 */
+				if (index_only && unlikely(scan->xs_visited_pages_limit > 0) &&
+					hscan->xs_blkswitch_count > scan->xs_visited_pages_limit)
+					return false;	/* give up */
+
+				continue;		/* try next index entry */
+			}
+
+			/* Found a visible tuple, so its HOT chain can't be all-dead */
+			Assert(!all_dead);
+
+			/*
+			 * Only MVCC snapshots are supported with standard index-only
+			 * scans, so there should be no need to keep following the HOT
+			 * chain once a visible entry has been found.  Other callers
+			 * (currently only selfuncs.c) use SnapshotNonVacuumable, and want
+			 * us to assume that just having one visible tuple in the hot
+			 * chain is always good enough.
+			 */
+			Assert(!index_only ||
+				   !(*heap_continue && IsMVCCSnapshot(scan->xs_snapshot)));
+		}
+		else
+		{
+			/*
+			 * Index-only scan with all-visible item.
+			 *
+			 * We won't access the heap, so we'll need to take a predicate
+			 * lock explicitly, as if we had.  For now we do that at page
+			 * level.
+			 */
+			PredicateLockPage(scan->heapRelation,
+							  ItemPointerGetBlockNumber(tid),
+							  scan->xs_snapshot);
+		}
+
+		/*
+		 * Found a tuple to return.
+		 *
+		 * Index-only scans fill caller's slot from the index data the AM
+		 * returned (in scan->xs_itup or xs_hitup); plain index scans already
+		 * had heapam_index_heap_fetch store the heap tuple there.
+		 */
+		if (index_only)
+			heap_fill_ios_slot(scan, slot);
+
+		*recheck = scan->xs_recheck;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Get the scan's next heap tuple.
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ */
+static pg_attribute_always_inline bool
+heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
+						TupleTableSlot *slot, bool index_only,
+						bool *heap_continue, bool *all_dead)
+{
+	Relation	rel = scan->heapRelation;
+	ItemPointer tid = &scan->xs_heaptid;
+	Snapshot	snapshot = scan->xs_snapshot;
+	HeapTupleData idxtupdata;
+	HeapTuple	heapTuple;
 	bool		got_heap_tuple;
 
-	Assert(TTS_IS_BUFFERTUPLE(slot));
+	*all_dead = false;
+
+	if (!index_only)
+	{
+		/* Plain index scans have us store fetched tuple in their slot */
+		BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+
+		Assert(TTS_IS_BUFFERTUPLE(slot));
+		heapTuple = &bslot->base.tupdata;
+	}
+	else
+	{
+		/*
+		 * Index-only scans just need us to verify tuple visibility, so don't
+		 * pass us a slot
+		 */
+		pg_assume(slot == NULL);
+		heapTuple = &idxtupdata;
+	}
 
 	/* We can skip the buffer-switching logic if we're on the same page. */
 	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
@@ -249,17 +604,21 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Remember this buffer's block number for next time */
 		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
+		/*
+		 * We're switching to a new heap block, so count it
+		 */
+		hscan->xs_blkswitch_count++;
+
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
 		 */
-		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf,
-							&hscan->xs_vmbuffer,
-							hscan->xs_base.flags & SO_HINT_REL_READ_ONLY);
+		heap_page_prune_opt(rel, hscan->xs_cbuf, &hscan->xs_vmbuffer,
+							hscan->xs_readonly);
 	}
 
 	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
@@ -268,25 +627,41 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
-											hscan->xs_base.rel,
+											rel,
 											hscan->xs_cbuf,
 											snapshot,
-											&bslot->base.tupdata,
+											heapTuple,
 											all_dead,
 											!*heap_continue);
-	bslot->base.tupdata.t_self = *tid;
+	heapTuple->t_self = *tid;
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
 
 	if (got_heap_tuple)
 	{
-		/*
-		 * Only in a non-MVCC snapshot can more than one member of the HOT
-		 * chain be visible.
-		 */
-		*heap_continue = !IsMVCCLikeSnapshot(snapshot);
+		if (!index_only)
+		{
+			/*
+			 * Only in a non-MVCC snapshot plain scan can more than one member
+			 * of the HOT chain be visible
+			 */
+			*heap_continue = !IsMVCCLikeSnapshot(snapshot);
 
-		slot->tts_tableOid = RelationGetRelid(scan->rel);
-		ExecStoreBufferHeapTuple(&bslot->base.tupdata, slot, hscan->xs_cbuf);
+			slot->tts_tableOid = RelationGetRelid(rel);
+			ExecStoreBufferHeapTuple(heapTuple, slot, hscan->xs_cbuf);
+		}
+		else
+		{
+			/*
+			 * Index-only scans unconditionally assume that the first member
+			 * of a HOT chain should be considered visible.  This is the
+			 * normal MVCC snapshot behavior, and works well enough for
+			 * non-MVCC index-only scans (currently the only core code that
+			 * uses a non-MVCC index-only scan is selfuncs.c).
+			 */
+			*heap_continue = false;
+		}
+
+		pgstat_count_heap_fetch(scan->indexRelation);
 	}
 	else
 	{
@@ -296,3 +671,23 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	return got_heap_tuple;
 }
+
+/*
+ * Called when we scanned a whole HOT chain and found only dead tuples:
+ * arrange for the index AM to kill its entry for that TID.  We do not do this
+ * when in recovery because it may violate MVCC to do so.  See comments in
+ * RelationGetIndexScan().
+ */
+static pg_attribute_always_inline void
+heapam_index_kill_item(IndexScanDesc scan)
+{
+	if (scan->xactStartedInRecovery)
+		return;
+
+	/*
+	 * Tell amgettuple-based index AM to kill its entry for that TID.  The
+	 * next index_getnext_tid call will pass that along to the index AM,
+	 * before unsetting the flag again.
+	 */
+	scan->kill_prior_tuple = true;
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4fd470702..4ba9f48e9 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
  * since we don't lock the visibility map page either, it's even possible that
  * someone else could have changed the bit just before we look at it, but yet
  * we might see the old value.  It is the caller's responsibility to deal with
- * all concurrency issues!
+ * all concurrency issues!  In practice it can't be stale enough to matter for
+ * the primary use case: index-only scans that check whether a heap fetch can
+ * be skipped.
+ *
+ * The argument for why it can't be stale enough to matter for the primary use
+ * case is as follows:
+ *
+ * Inserts: we need to detect that a VM bit was cleared by an insert right
+ * away, because the new tuple is present in the index but not yet visible.
+ * Reading the TID from the index page (under a shared lock on the index
+ * buffer) is serialized with the insertion of the TID into the index (under
+ * an exclusive lock on the same index buffer).  Because the VM bit is cleared
+ * before the index is updated, and locking/unlocking of the index page acts
+ * as a full memory barrier, we are sure to see the cleared bit whenever we
+ * see a recently-inserted TID.
+ *
+ * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
+ * index page access, because deletes do not update the index page (only
+ * VACUUM removes the index TID).  So we may see a significantly stale value.
+ * However, we don't need to detect the delete right away, because the tuple
+ * remains visible until the deleting transaction commits or the statement
+ * ends (if it's our own transaction).  In either case, the lock on the VM
+ * buffer will have been released (acting as a write barrier) after clearing
+ * the bit.  And for us to have a snapshot that includes the deleting
+ * transaction (making the tuple invisible), we must have acquired
+ * ProcArrayLock after that time, acting as a read barrier.
  */
 uint8
 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 1408989c5..1512438d6 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -84,7 +84,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan = palloc_object(IndexScanDescData);
 
 	scan->heapRelation = NULL;	/* may be set later */
-	scan->xs_heapfetch = NULL;
+	scan->xs_table_opaque = NULL;
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
@@ -126,6 +126,13 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	scan->xs_getnext_slot = NULL;
+
+	scan->xs_name_cstring_attnums = NULL;
+	scan->xs_name_cstring_count = 0;
+
+	scan->xs_visited_pages_limit = 0;
+
 	return scan;
 }
 
@@ -148,6 +155,8 @@ IndexScanEnd(IndexScanDesc scan)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
 		pfree(scan->orderByData);
+	if (scan->xs_name_cstring_attnums != NULL)
+		pfree(scan->xs_name_cstring_attnums);
 
 	pfree(scan);
 }
@@ -454,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0,
 										 SO_NONE);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
@@ -518,7 +527,10 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		bool		recheck;
+
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot, &recheck))
 		{
 			bool		shouldFree;
 
@@ -533,7 +545,7 @@ systable_getnext(SysScanDesc sysscan)
 			 * because we still wouldn't need to support indexes on
 			 * expressions.
 			 */
-			if (sysscan->iscan->xs_recheck)
+			if (recheck)
 				elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 		}
 	}
@@ -643,7 +655,7 @@ systable_endscan(SysScanDesc sysscan)
  * we could do a heapscan and sort, but the uses are in places that
  * probably don't need to still work with corrupted catalog indexes.)
  * For the moment, therefore, these functions are merely the thinest of
- * wrappers around index_beginscan/index_getnext_slot.  The main reason for
+ * wrappers around index_beginscan/table_index_getnext_slot.  The main reason for
  * their existence is to centralize possible future support of lossy operators
  * in catalog scans.
  */
@@ -716,7 +728,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	if (TransactionIdIsValid(CheckXidAlive))
 		bsysscan = true;
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0,
 									 SO_NONE);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
@@ -734,13 +746,15 @@ HeapTuple
 systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 {
 	HeapTuple	htup = NULL;
+	bool		recheck;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot,
+								 &recheck))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
-	if (htup && sysscan->iscan->xs_recheck)
+	if (htup && recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
 	/*
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 7967e9398..aa0d4b143 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -105,9 +103,16 @@ do { \
 			 CppAsString(pname), RelationGetRelationName(scan->indexRelation)); \
 } while(0)
 
-static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+static pg_attribute_always_inline IndexScanDesc index_beginscan_internal(Relation indexRelation,
+																		 Relation heapRelation,
+																		 int nkeys,
+																		 int norderbys,
+																		 Snapshot snapshot,
+																		 ParallelIndexScanDesc pscan,
+																		 IndexScanInstrumentation *instrument,
+																		 bool index_only_scan,
+																		 bool temp_snap,
+																		 uint32 flags);
 static inline void validate_relation_as_index(Relation r);
 
 
@@ -256,13 +261,12 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool index_only_scan,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys,
 				uint32 flags)
 {
-	IndexScanDesc scan;
-
 	Assert(snapshot != InvalidSnapshot);
 
 	/* Check that a historic snapshot is not used for non-catalog tables */
@@ -275,20 +279,10 @@ index_beginscan(Relation heapRelation,
 						RelationGetRelationName(heapRelation))));
 	}
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
-
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by RelationGetIndexScan.
-	 */
-	scan->heapRelation = heapRelation;
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags);
-
-	return scan;
+	return index_beginscan_internal(indexRelation, heapRelation,
+									nkeys, norderbys,
+									snapshot, NULL, instrument,
+									index_only_scan, false, flags);
 }
 
 /*
@@ -303,29 +297,24 @@ index_beginscan_bitmap(Relation indexRelation,
 					   IndexScanInstrumentation *instrument,
 					   int nkeys)
 {
-	IndexScanDesc scan;
-
 	Assert(snapshot != InvalidSnapshot);
+	Assert(IsMVCCLikeSnapshot(snapshot));
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
-
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by RelationGetIndexScan.
-	 */
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	return scan;
+	return index_beginscan_internal(indexRelation, NULL, nkeys, 0, snapshot,
+									NULL, instrument, false, false, SO_NONE);
 }
 
 /*
  * index_beginscan_internal --- common code for index_beginscan variants
+ *
+ * When heapRelation is not NULL, also initializes table AM index scan state.
  */
-static IndexScanDesc
-index_beginscan_internal(Relation indexRelation,
+static pg_attribute_always_inline IndexScanDesc
+index_beginscan_internal(Relation indexRelation, Relation heapRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan,
+						 IndexScanInstrumentation *instrument,
+						 bool index_only_scan, bool temp_snap, uint32 flags)
 {
 	IndexScanDesc scan;
 
@@ -349,6 +338,62 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	scan->xs_snapshot = snapshot;
+	scan->instrument = instrument;
+
+	/*
+	 * Initialize heap-side scan state when a heap relation is provided.
+	 * Bitmap index scans don't provide one.
+	 */
+	if (heapRelation != NULL)
+	{
+		scan->heapRelation = heapRelation;
+		scan->xs_want_itup = index_only_scan;
+		scan->xs_heap_continue = false;
+
+		/*
+		 * For index-only scans, find any "name" columns stored as cstrings
+		 * (e.g. btree name_ops), which the table AM must re-pad to
+		 * NAMEDATALEN when filling a slot from the index tuple.  We detect
+		 * this generically by looking for index attributes whose stored type
+		 * is CSTRINGOID while their opclass input type is NAMEOID.  This is
+		 * done before table_index_scan_begin so the table AM can size
+		 * per-tuple workspace accordingly.
+		 */
+		if (index_only_scan)
+		{
+			int			indnkeyatts = indexRelation->rd_index->indnkeyatts;
+			int			namecount = 0;
+
+			for (int attnum = 0; attnum < indnkeyatts; attnum++)
+			{
+				if (TupleDescAttr(indexRelation->rd_att, attnum)->atttypid == CSTRINGOID &&
+					indexRelation->rd_opcintype[attnum] == NAMEOID)
+					namecount++;
+			}
+
+			if (namecount > 0)
+			{
+				int			idx = 0;
+
+				scan->xs_name_cstring_attnums = palloc_array(AttrNumber, namecount);
+				for (int attnum = 0; attnum < indnkeyatts; attnum++)
+				{
+					if (TupleDescAttr(indexRelation->rd_att, attnum)->atttypid == CSTRINGOID &&
+						indexRelation->rd_opcintype[attnum] == NAMEOID)
+						scan->xs_name_cstring_attnums[idx++] = (AttrNumber) attnum;
+				}
+				scan->xs_name_cstring_count = namecount;
+			}
+		}
+
+		/* set up table AM state for the index scan (sets xs_table_opaque) */
+		table_index_scan_begin(scan, flags);
+
+		/* table AM must set these for us */
+		Assert(scan->xs_getnext_slot != NULL && scan->xs_table_opaque != NULL);
+	}
+
 	return scan;
 }
 
@@ -376,8 +421,8 @@ index_rescan(IndexScanDesc scan,
 	Assert(norderbys == scan->numberOfOrderBys);
 
 	/* reset table AM state for rescan */
-	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+	if (scan->xs_table_opaque)
+		table_index_scan_reset(scan);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -397,10 +442,10 @@ index_endscan(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amendscan);
 
 	/* Release resources (like buffer pins) from table accesses */
-	if (scan->xs_heapfetch)
+	if (scan->xs_table_opaque)
 	{
-		table_index_fetch_end(scan->xs_heapfetch);
-		scan->xs_heapfetch = NULL;
+		table_index_scan_end(scan);
+		scan->xs_table_opaque = NULL;
 	}
 
 	/* End the AM's scan */
@@ -453,8 +498,8 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amrestrpos);
 
 	/* reset table AM state for restoring the marked position */
-	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+	if (scan->xs_table_opaque)
+		table_index_scan_reset(scan);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -540,8 +585,8 @@ index_parallelrescan(IndexScanDesc scan)
 	SCAN_CHECKS;
 
 	/* reset table AM state for rescan */
-	if (scan->xs_heapfetch)
-		table_index_fetch_reset(scan->xs_heapfetch);
+	if (scan->xs_table_opaque)
+		table_index_scan_reset(scan);
 
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
@@ -558,41 +603,33 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool index_only_scan,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan,
 						 uint32 flags)
 {
 	Snapshot	snapshot;
-	IndexScanDesc scan;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
-	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
 
-	/*
-	 * Save additional parameters into the scandesc.  Everything else was set
-	 * up by index_beginscan_internal.
-	 */
-	scan->heapRelation = heaprel;
-	scan->xs_snapshot = snapshot;
-	scan->instrument = instrument;
-
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags);
-
-	return scan;
+	return index_beginscan_internal(indexrel, heaprel, nkeys, norderbys,
+									snapshot, pscan, instrument,
+									index_only_scan, true, flags);
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM amgettuple-based index scan
+ * callbacks.
  * ----------------
  */
 ItemPointer
@@ -622,8 +659,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	if (!found)
 	{
 		/* reset table AM state */
-		if (scan->xs_heapfetch)
-			table_index_fetch_reset(scan->xs_heapfetch);
+		if (scan->xs_table_opaque)
+			table_index_scan_reset(scan);
 
 		return NULL;
 	}
@@ -635,97 +672,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index c8af97dd2..f1b55fb20 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -560,9 +560,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 * with optimizations like heap's HOT, we have just a single
 				 * index entry for the entire chain.
 				 */
-				else if (table_index_fetch_tuple_check(heapRel, &htid,
-													   &SnapshotDirty,
-													   &all_dead))
+				else if (table_fetch_tid_check(heapRel, &htid,
+											   &SnapshotDirty,
+											   &all_dead))
 				{
 					TransactionId xwait;
 
@@ -618,8 +618,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * entry.
 					 */
 					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
-													  SnapshotSelf, NULL))
+					if (table_fetch_tid_check(heapRel, &htid,
+											  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
 					}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 68ff0966f..0ac9f0143 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -228,32 +228,20 @@ table_beginscan_parallel_tidrange(Relation relation,
  */
 
 /*
- * To perform that check simply start an index scan, create the necessary
- * slot, do the heap lookup, and shut everything down again. This could be
- * optimized, but is unlikely to matter from a performance POV. If there
- * frequently are live index pointers also matching a unique index key, the
- * CPU overhead of this routine is unlikely to matter.
- *
- * Note that *tid may be modified when we return true if the AM supports
- * storing multiple row versions reachable via a single index entry (like
- * heap's HOT).
+ * Caller should note the table_fetch_tid warning about *tid being modified
+ * when we return true in some cases
  */
 bool
-table_index_fetch_tuple_check(Relation rel,
-							  ItemPointer tid,
-							  Snapshot snapshot,
-							  bool *all_dead)
+table_fetch_tid_check(Relation rel,
+					  ItemPointer tid,
+					  Snapshot snapshot,
+					  bool *all_dead)
 {
-	IndexFetchTableData *scan;
 	TupleTableSlot *slot;
-	bool		call_again = false;
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel, SO_NONE);
-	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
-									all_dead);
-	table_index_fetch_end(scan);
+	found = table_fetch_tid(rel, tid, snapshot, slot, all_dead);
 	ExecDropSingleTupleTableSlot(slot);
 
 	return found;
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 5450a27fa..72d2c662b 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -50,11 +50,11 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->parallelscan_initialize != NULL);
 	Assert(routine->parallelscan_reinitialize != NULL);
 
-	Assert(routine->index_fetch_begin != NULL);
-	Assert(routine->index_fetch_reset != NULL);
-	Assert(routine->index_fetch_end != NULL);
-	Assert(routine->index_fetch_tuple != NULL);
+	Assert(routine->index_scan_begin != NULL);
+	Assert(routine->index_scan_reset != NULL);
+	Assert(routine->index_scan_end != NULL);
 
+	Assert(routine->fetch_tid != NULL);
 	Assert(routine->tuple_fetch_row_version != NULL);
 	Assert(routine->tuple_tid_valid != NULL);
 	Assert(routine->tuple_get_latest_tid != NULL);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 421d8c359..7aff48124 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -105,23 +105,14 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 * removed.
 	 */
 	tmptid = checktid;
+	if (!table_fetch_tid(trigdata->tg_relation, &tmptid, SnapshotSelf,
+						 slot, NULL))
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
-															SO_NONE);
-		bool		call_again = false;
-
-		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
-									 &call_again, NULL))
-		{
-			/*
-			 * All rows referenced by the index entry are dead, so skip the
-			 * check.
-			 */
-			ExecDropSingleTupleTableSlot(slot);
-			table_index_fetch_end(scan);
-			return PointerGetDatum(NULL);
-		}
-		table_index_fetch_end(scan);
+		/*
+		 * All rows referenced by the index entry are dead, so skip the check
+		 */
+		ExecDropSingleTupleTableSlot(slot);
+		return PointerGetDatum(NULL);
 	}
 
 	/*
@@ -168,9 +159,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 		/*
 		 * Note: this is not a real insert; it is a check that the index entry
 		 * that has already been inserted is unique.  Passing the tuple's tid
-		 * (i.e. unmodified by table_index_fetch_tuple()) is correct even if
-		 * the row is now dead, because that is the TID the index will know
-		 * about.
+		 * (i.e. unmodified by table_fetch_tid()) is correct even if the row
+		 * is now dead, because that is the TID the index will know about.
 		 */
 		index_insert(indexRel, values, isnull, &checktid,
 					 trigdata->tg_relation, UNIQUE_CHECK_EXISTING,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 112c17b0d..ebe852500 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -137,7 +137,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_scan_io_usage(ScanState *planstate,
@@ -1978,7 +1978,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1992,15 +1992,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3867,15 +3864,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				ntabletuplefetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3896,6 +3894,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument->nsearches;
+				ntabletuplefetches = indexstate->ioss_Instrument->ntabletuplefetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3919,9 +3918,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			ntabletuplefetches += winstrument->ntabletuplefetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, ntabletuplefetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/repack.c b/src/backend/commands/repack.c
index 4d177c868..fcf51e867 100644
--- a/src/backend/commands/repack.c
+++ b/src/backend/commands/repack.c
@@ -2862,6 +2862,7 @@ find_target_tuple(Relation rel, ChangeContext *chgcxt, TupleTableSlot *locator,
 	Form_pg_index idx = chgcxt->cc_ident_index->rd_index;
 	IndexScanDesc scan;
 	bool		retval = false;
+	bool		recheck;
 
 	/*
 	 * Scan key is passed by caller, so it does not have to be constructed
@@ -2880,13 +2881,13 @@ find_target_tuple(Relation rel, ChangeContext *chgcxt, TupleTableSlot *locator,
 	}
 
 	/* XXX no instrumentation for now */
-	scan = index_beginscan(rel, chgcxt->cc_ident_index, GetActiveSnapshot(),
+	scan = index_beginscan(rel, chgcxt->cc_ident_index, false, GetActiveSnapshot(),
 						   NULL, chgcxt->cc_ident_key_nentries, 0, 0);
 	index_rescan(scan, chgcxt->cc_ident_key, chgcxt->cc_ident_key_nentries, NULL, 0);
-	while (index_getnext_slot(scan, ForwardScanDirection, retrieved))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, retrieved, &recheck))
 	{
 		/* Be wary of temporal constraints */
-		if (scan->xs_recheck && !identity_key_equal(chgcxt, locator, retrieved))
+		if (recheck && !identity_key_equal(chgcxt, locator, retrieved))
 		{
 			CHECK_FOR_INTERRUPTS();
 			continue;
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index eb3838129..7791c4e5f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -721,6 +721,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	int			i;
 	bool		conflict;
 	bool		found_self;
+	bool		recheck;
 	ExprContext *econtext;
 	TupleTableSlot *existing_slot;
 	TupleTableSlot *save_scantuple;
@@ -823,12 +824,13 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index,
+	index_scan = index_beginscan(heap, index, false,
 								 &DirtySnapshot, NULL, indnkeyatts, 0,
 								 SO_NONE);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot, &recheck))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
@@ -858,7 +860,7 @@ retry:
 					   existing_values, existing_isnull);
 
 		/* If lossy indexscan, must recheck the condition */
-		if (index_scan->xs_recheck)
+		if (recheck)
 		{
 			if (!index_recheck_constraint(index,
 										  constr_procs,
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index b2ca5cbf1..75db5f363 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -191,6 +191,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	TransactionId xwait;
 	Relation	idxrel;
 	bool		found;
+	bool		recheck;
 	TypeCacheEntry **eq = NULL;
 	bool		isIdxSafeToSkipDuplicates;
 
@@ -205,7 +206,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel,
+	scan = index_beginscan(rel, idxrel, false,
 						   &snap, NULL, skey_attoff, 0, SO_NONE);
 
 retry:
@@ -214,7 +215,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot, &recheck))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -228,6 +229,8 @@ retry:
 			if (!tuples_equal(outslot, searchslot, eq, NULL))
 				continue;
 		}
+		else
+			Assert(!recheck);
 
 		ExecMaterializeSlot(outslot);
 
@@ -645,6 +648,7 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	TupleTableSlot *scanslot;
 	TypeCacheEntry **eq = NULL;
 	bool		isIdxSafeToSkipDuplicates;
+	bool		recheck;
 	TupleDesc	desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
 
 	Assert(equalTupleDescs(desc, searchslot->tts_tupleDescriptor));
@@ -669,13 +673,13 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel,
+	scan = index_beginscan(rel, idxrel, false,
 						   SnapshotAny, NULL, skey_attoff, 0, SO_NONE);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot, &recheck))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -689,6 +693,8 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 			if (!tuples_equal(scanslot, searchslot, eq, NULL))
 				continue;
 		}
+		else
+			Assert(!recheck);
 
 		update_most_recent_deletion_info(scanslot, oldestxmin, delete_xid,
 										 delete_time, delete_origin);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1..90b010f9b 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -204,6 +204,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument->nsearches;
+		Assert(node->biss_Instrument->ntabletuplefetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index d52012e8a..28a23db0b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,7 +34,6 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/instrument.h"
@@ -42,14 +41,11 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
-static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
-							IndexTuple itup, TupleDesc itupdesc);
 
 
 /* ----------------------------------------------------------------
@@ -66,7 +62,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
+	bool		recheck;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,6 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   true,
 								   estate->es_snapshot,
 								   node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -100,11 +97,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   SO_HINT_REL_READ_ONLY : SO_NONE);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -121,104 +114,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, slot, &recheck))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
-		/*
-		 * Fill the scan tuple slot with data from the index.  This might be
-		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
-		 * index AM might fill both fields, in which case we prefer the heap
-		 * format, since it's probably a bit cheaper to fill a slot from.
-		 */
-		if (scandesc->xs_hitup)
-		{
-			/*
-			 * We don't take the trouble to verify that the provided tuple has
-			 * exactly the slot's format, but it seems worth doing a quick
-			 * check on the number of fields.
-			 */
-			Assert(slot->tts_tupleDescriptor->natts ==
-				   scandesc->xs_hitupdesc->natts);
-			ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
-		}
-		else if (scandesc->xs_itup)
-			StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
-		else
-			elog(ERROR, "no data returned for index-only scan");
-
 		/*
 		 * If the index was lossy, we have to recheck the index quals.
 		 */
-		if (scandesc->xs_recheck)
+		if (recheck)
 		{
 			econtext->ecxt_scantuple = slot;
 			if (!ExecQualAndReset(node->recheckqual, econtext))
@@ -241,16 +144,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -261,62 +154,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	return ExecClearTuple(slot);
 }
 
-/*
- * StoreIndexTuple
- *		Fill the slot with data from the index tuple.
- *
- * At some point this might be generally-useful functionality, but
- * right now we don't need it elsewhere.
- */
-static void
-StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
-				IndexTuple itup, TupleDesc itupdesc)
-{
-	/*
-	 * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
-	 * not the slot's tupdesc, in case the latter has different datatypes
-	 * (this happens for btree name_ops in particular).  They'd better have
-	 * the same number of columns though, as well as being datatype-compatible
-	 * which is something we can't so easily check.
-	 */
-	Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
-
-	ExecClearTuple(slot);
-	index_deform_tuple(itup, itupdesc, slot->tts_values, slot->tts_isnull);
-
-	/*
-	 * Copy all name columns stored as cstrings back into a NAMEDATALEN byte
-	 * sized allocation.  We mark this branch as unlikely as generally "name"
-	 * is used only for the system catalogs and this would have to be a user
-	 * query running on those or some other user table with an index on a name
-	 * column.
-	 */
-	if (unlikely(node->ioss_NameCStringAttNums != NULL))
-	{
-		int			attcount = node->ioss_NameCStringCount;
-
-		for (int idx = 0; idx < attcount; idx++)
-		{
-			int			attnum = node->ioss_NameCStringAttNums[idx];
-			Name		name;
-
-			/* skip null Datums */
-			if (slot->tts_isnull[attnum])
-				continue;
-
-			/* allocate the NAMEDATALEN and copy the datum into that memory */
-			name = (Name) MemoryContextAlloc(node->ss.ps.ps_ExprContext->ecxt_per_tuple_memory,
-											 NAMEDATALEN);
-
-			/* use namestrcpy to zero-pad all trailing bytes */
-			namestrcpy(name, DatumGetCString(slot->tts_values[attnum]));
-			slot->tts_values[attnum] = NameGetDatum(name);
-		}
-	}
-
-	ExecStoreVirtualTuple(slot);
-}
-
 /*
  * IndexOnlyRecheck -- access method routine to recheck a tuple in EvalPlanQual
  *
@@ -410,13 +247,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -436,6 +266,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument->nsearches;
+		winstrument->ntabletuplefetches += node->ioss_Instrument->ntabletuplefetches;
 	}
 
 	/*
@@ -535,8 +366,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	Relation	indexRelation;
 	LOCKMODE	lockmode;
 	TupleDesc	tupDesc;
-	int			indnkeyatts;
-	int			namecount;
 
 	/*
 	 * create state structure
@@ -573,15 +402,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 						  &TTSOpsVirtual,
 						  0);
 
-	/*
-	 * We need another slot, in a format that's suitable for the table AM, for
-	 * when we need to fetch a tuple from the table for rechecking visibility.
-	 */
-	indexstate->ioss_TableSlot =
-		ExecAllocTableSlot(&estate->es_tupleTable,
-						   RelationGetDescr(currentRelation),
-						   table_slot_callbacks(currentRelation), 0);
-
 	/*
 	 * Initialize result type and projection info.  The node's targetlist will
 	 * contain Vars with varno = INDEX_VAR, referencing the scan tuple.
@@ -671,48 +491,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		indexstate->ioss_RuntimeContext = NULL;
 	}
 
-	indexstate->ioss_NameCStringAttNums = NULL;
-	indnkeyatts = indexRelation->rd_index->indnkeyatts;
-	namecount = 0;
-
-	/*
-	 * The "name" type for btree uses text_ops which results in storing
-	 * cstrings in the indexed keys rather than names.  Here we detect that in
-	 * a generic way in case other index AMs want to do the same optimization.
-	 * Check for opclasses with an opcintype of NAMEOID and an index tuple
-	 * descriptor with CSTRINGOID.  If any of these are found, create an array
-	 * marking the index attribute number of each of them.  StoreIndexTuple()
-	 * handles copying the name Datums into a NAMEDATALEN-byte allocation.
-	 */
-
-	/* First, count the number of such index keys */
-	for (int attnum = 0; attnum < indnkeyatts; attnum++)
-	{
-		if (TupleDescAttr(indexRelation->rd_att, attnum)->atttypid == CSTRINGOID &&
-			indexRelation->rd_opcintype[attnum] == NAMEOID)
-			namecount++;
-	}
-
-	if (namecount > 0)
-	{
-		int			idx = 0;
-
-		/*
-		 * Now create an array to mark the attribute numbers of the keys that
-		 * need to be converted from cstring to name.
-		 */
-		indexstate->ioss_NameCStringAttNums = palloc_array(AttrNumber, namecount);
-
-		for (int attnum = 0; attnum < indnkeyatts; attnum++)
-		{
-			if (TupleDescAttr(indexRelation->rd_att, attnum)->atttypid == CSTRINGOID &&
-				indexRelation->rd_opcintype[attnum] == NAMEOID)
-				indexstate->ioss_NameCStringAttNums[idx++] = (AttrNumber) attnum;
-		}
-	}
-
-	indexstate->ioss_NameCStringCount = namecount;
-
 	/*
 	 * all done.
 	 */
@@ -768,14 +546,14 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
 								 ScanRelIsReadOnly(&node->ss) ?
 								 SO_HINT_REL_READ_ONLY : SO_NONE);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -818,13 +596,14 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 true,
 								 node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
 								 ScanRelIsReadOnly(&node->ss) ?
 								 SO_HINT_REL_READ_ONLY : SO_NONE);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 39f6691ee..457fbdb07 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -86,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	bool		recheck;
 
 	/*
 	 * extract necessary information from index scan node
@@ -110,6 +111,7 @@ IndexNext(IndexScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
+								   false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -132,7 +134,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot, &recheck))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -140,7 +142,7 @@ IndexNext(IndexScanState *node)
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
 		 */
-		if (scandesc->xs_recheck)
+		if (recheck)
 		{
 			econtext->ecxt_scantuple = slot;
 			if (!ExecQualAndReset(node->indexqualorig, econtext))
@@ -178,6 +180,7 @@ IndexNextWithReorder(IndexScanState *node)
 	TupleTableSlot *slot;
 	ReorderTuple *topmost = NULL;
 	bool		was_exact;
+	bool		recheck;
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
@@ -208,6 +211,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
+								   false,
 								   estate->es_snapshot,
 								   node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -266,7 +270,8 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot,
+									  &recheck))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -280,7 +285,7 @@ next_indextuple:
 		 * If the index was lossy, we have to recheck the index quals and
 		 * ORDER BY expressions using the fetched tuple.
 		 */
-		if (scandesc->xs_recheck)
+		if (recheck)
 		{
 			econtext->ecxt_scantuple = slot;
 			if (!ExecQualAndReset(node->indexqualorig, econtext))
@@ -818,6 +823,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument->nsearches;
+		Assert(node->iss_Instrument->ntabletuplefetches == 0);
 	}
 
 	/*
@@ -1706,6 +1712,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
+								 false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1754,6 +1761,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
+								 false,
 								 node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 44129a35c..761e2050c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2827,7 +2827,7 @@ ri_FastPathCheck(RI_ConstraintInfo *riinfo,
 	idx_rel = index_open(riinfo->conindid, AccessShareLock);
 
 	slot = table_slot_create(pk_rel, NULL);
-	scandesc = index_beginscan(pk_rel, idx_rel,
+	scandesc = index_beginscan(pk_rel, idx_rel, false,
 							   snapshot, NULL,
 							   riinfo->nkeys, 0,
 							   SO_NONE);
@@ -2964,7 +2964,7 @@ ri_FastPathBatchFlush(RI_FastPathEntry *fpentry, Relation fk_rel,
 	 */
 	oldcxt = MemoryContextSwitchTo(fpentry->flush_cxt);
 
-	scandesc = index_beginscan(pk_rel, idx_rel, snapshot, NULL,
+	scandesc = index_beginscan(pk_rel, idx_rel, false, snapshot, NULL,
 							   riinfo->nkeys, 0, SO_NONE);
 
 	GetUserIdAndSecContext(&saved_userid, &saved_sec_context);
@@ -3108,6 +3108,7 @@ ri_FastPathFlushArray(RI_FastPathEntry *fpentry, TupleTableSlot *fk_slot,
 	bool		elem_byval;
 	char		elem_align;
 	ArrayType  *arr;
+	bool		recheck;
 
 	Assert(fpmeta);
 
@@ -3174,13 +3175,16 @@ ri_FastPathFlushArray(RI_FastPathEntry *fpentry, TupleTableSlot *fk_slot,
 	 * Walk all matches.  The index AM returns them in index order.  For each
 	 * match, find which batch item(s) it satisfies.
 	 */
-	while (index_getnext_slot(scandesc, ForwardScanDirection, pk_slot))
+	while (table_index_getnext_slot(scandesc, ForwardScanDirection, pk_slot,
+									&recheck))
 	{
 		Datum		found_val;
 		bool		found_null;
 		bool		concurrently_updated;
 		ScanKeyData recheck_skey[1];
 
+		Assert(!recheck);
+
 		if (!ri_LockPKTuple(pk_rel, pk_slot, snapshot, &concurrently_updated))
 			continue;
 
@@ -3244,13 +3248,17 @@ ri_FastPathProbeOne(Relation pk_rel, Relation idx_rel,
 					ScanKeyData *skey, int nkeys)
 {
 	bool		found = false;
+	bool		recheck;
 
 	index_rescan(scandesc, skey, nkeys, NULL, 0);
 
-	if (index_getnext_slot(scandesc, ForwardScanDirection, slot))
+	if (table_index_getnext_slot(scandesc, ForwardScanDirection, slot,
+								 &recheck))
 	{
 		bool		concurrently_updated;
 
+		Assert(!recheck);
+
 		if (ri_LockPKTuple(pk_rel, slot, snapshot,
 						   &concurrently_updated))
 		{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d6efd0707..fb978f0cf 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -122,7 +121,6 @@
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "statistics/statistics.h"
-#include "storage/bufmgr.h"
 #include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
@@ -266,7 +264,7 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 										 ScanKey scankeys,
 										 int16 typLen,
 										 bool typByVal,
-										 TupleTableSlot *tableslot,
+										 TupleTableSlot *slot,
 										 MemoryContext outercontext,
 										 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
@@ -7041,7 +7039,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			indexRel = index_open(index->indexoid, NoLock);
 
 			/* build some stuff needed for indexscan execution */
-			slot = table_slot_create(heapRel, NULL);
+			slot = MakeSingleTupleTableSlot(RelationGetDescr(indexRel),
+											&TTSOpsVirtual);
 			get_typlenbyval(vardata->atttype, &typLen, &typByVal);
 
 			/* set up an IS NOT NULL scan key so that we ignore nulls */
@@ -7113,8 +7112,7 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
  *
  * scankeys is a 1-element scankey array set up to reject nulls.
  * typLen/typByVal describe the datatype of the index's first column.
- * tableslot is a slot suitable to hold table tuples, in case we need
- * to probe the heap.
+ * slot is a virtual slot to receive each index tuple's values.
  * (We could compute these values locally, but that would mean computing them
  * twice when get_actual_variable_range needs both the min and the max.)
  *
@@ -7128,19 +7126,16 @@ get_actual_variable_endpoint(Relation heapRel,
 							 ScanKey scankeys,
 							 int16 typLen,
 							 bool typByVal,
-							 TupleTableSlot *tableslot,
+							 TupleTableSlot *slot,
 							 MemoryContext outercontext,
 							 Datum *endpointDatum)
 {
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
-	Datum		values[INDEX_MAX_KEYS];
-	bool		isnull[INDEX_MAX_KEYS];
+	bool		recheck;
+	Datum		val;
+	bool		isnull;
 	MemoryContext oldcontext;
 
 	/*
@@ -7186,95 +7181,49 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * We set xs_visited_pages_limit to tell the table AM to count distinct
+	 * heap pages visited for non-visible tuples and give up after the limit
+	 * is exceeded.
 	 */
+#define VISITED_PAGES_LIMIT 100
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, true,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0,
 								 SO_NONE);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
+	index_scan->xs_visited_pages_limit = VISITED_PAGES_LIMIT;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, slot, &recheck))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
-
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
-#define VISITED_PAGES_LIMIT 100
-
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
-
-		/*
-		 * We expect that the index will return data in IndexTuple not
-		 * HeapTuple format.
-		 */
-		if (!index_scan->xs_itup)
-			elog(ERROR, "no data returned for index-only scan");
-
 		/*
 		 * We do not yet support recheck here.
 		 */
-		if (index_scan->xs_recheck)
+		if (recheck)
 			break;
 
-		/* OK to deconstruct the index tuple */
-		index_deform_tuple(index_scan->xs_itup,
-						   index_scan->xs_itupdesc,
-						   values, isnull);
+		/* Read the index's first column value out of the slot */
+		val = slot_getattr(slot, 1, &isnull);
 
 		/* Shouldn't have got a null, but be careful */
-		if (isnull[0])
+		if (isnull)
 			elog(ERROR, "found unexpected null value in index \"%s\"",
 				 RelationGetRelationName(indexRel));
 
 		/* Copy the index column value out to caller's context */
 		oldcontext = MemoryContextSwitchTo(outercontext);
-		*endpointDatum = datumCopy(values[0], typByVal, typLen);
+		*endpointDatum = datumCopy(val, typByVal, typLen);
 		MemoryContextSwitchTo(oldcontext);
 		have_data = true;
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
+	ExecClearTuple(slot);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/src/test/modules/index/Makefile b/src/test/modules/index/Makefile
index 29047044e..83dd09745 100644
--- a/src/test/modules/index/Makefile
+++ b/src/test/modules/index/Makefile
@@ -1,7 +1,17 @@
 # src/test/modules/index/Makefile
 
+MODULE_big = test_indexscan
+OBJS = \
+	$(WIN32RES) \
+	test_indexscan.o
+EXTENSION = test_indexscan
+DATA = test_indexscan--1.0.sql
+PGFILEDESC = "test_indexscan - test index scan internals"
+
 EXTRA_INSTALL = contrib/btree_gin contrib/btree_gist
 
+REGRESS = hot_chain
+
 ISOLATION = killtuples
 
 ifdef USE_PGXS
diff --git a/src/test/modules/index/expected/hot_chain.out b/src/test/modules/index/expected/hot_chain.out
new file mode 100644
index 000000000..0e7832a84
--- /dev/null
+++ b/src/test/modules/index/expected/hot_chain.out
@@ -0,0 +1,56 @@
+-- Non-MVCC index scans must return every visible member of a HOT chain.
+-- Verify that table_index_getnext_slot gets that right in a variety of cases.
+CREATE EXTENSION test_indexscan;
+-- Single-page table; all TIDs below are deterministic: a fresh table gets
+-- block 0, heap_insert/heap_update assign line pointers sequentially, and the
+-- rows are tiny to avoid any toasting.
+CREATE TABLE hot_chain_tab (id int, filler text) WITH (autovacuum_enabled = off);
+CREATE INDEX hot_chain_idx ON hot_chain_tab (id);
+INSERT INTO hot_chain_tab VALUES (1, 'r1v1'), (2, 'r2v1');  -- (0,1) (0,2)
+BEGIN;
+-- Create HOT chains.  These updates touch only the non-indexed column
+-- "filler", and every chain member stays visible to SnapshotAny because the
+-- deleting transaction (us) is still open.
+UPDATE hot_chain_tab SET filler = 'r1v2' WHERE id = 1;   -- (0,3)
+UPDATE hot_chain_tab SET filler = 'r1v3' WHERE id = 1;   -- (0,4)
+UPDATE hot_chain_tab SET filler = 'r2v2' WHERE id = 2;   -- (0,5)
+-- Verify that all three updates were HOT using backend-local xact counter:
+SELECT pg_stat_get_xact_tuples_hot_updated('hot_chain_tab'::regclass) AS hot_updated;
+ hot_updated 
+-------------
+           3
+(1 row)
+
+-- SnapshotAny: every chain member
+SELECT * FROM index_scan_tids('hot_chain_idx', 'any');
+ index_scan_tids 
+-----------------
+ (0,1)
+ (0,3)
+ (0,4)
+ (0,2)
+ (0,5)
+(5 rows)
+
+-- MVCC scan, for contrast: only the newest member of each chain
+SELECT * FROM index_scan_tids('hot_chain_idx', 'mvcc');
+ index_scan_tids 
+-----------------
+ (0,4)
+ (0,5)
+(2 rows)
+
+-- backward: index entries in reverse key order, chains still in ASC chain order
+SELECT * FROM index_scan_tids('hot_chain_idx', 'any', 'backward');
+ index_scan_tids 
+-----------------
+ (0,2)
+ (0,5)
+ (0,1)
+ (0,3)
+ (0,4)
+(5 rows)
+
+COMMIT;
+DROP TABLE hot_chain_tab;
+DROP EXTENSION test_indexscan;
diff --git a/src/test/modules/index/meson.build b/src/test/modules/index/meson.build
index 834ce081f..5f5630230 100644
--- a/src/test/modules/index/meson.build
+++ b/src/test/modules/index/meson.build
@@ -1,9 +1,35 @@
 # Copyright (c) 2025-2026, PostgreSQL Global Development Group
 
+test_indexscan_sources = files(
+  'test_indexscan.c',
+)
+
+if host_system == 'windows'
+  test_indexscan_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_indexscan',
+    '--FILEDESC', 'test_indexscan - test index scan internals',])
+endif
+
+test_indexscan = shared_module('test_indexscan',
+  test_indexscan_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_indexscan
+
+test_install_data += files(
+  'test_indexscan.control',
+  'test_indexscan--1.0.sql',
+)
+
 tests += {
   'name': 'index',
   'sd': meson.current_source_dir(),
   'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'hot_chain',
+    ],
+  },
   'isolation': {
     'specs': [
       'killtuples',
diff --git a/src/test/modules/index/sql/hot_chain.sql b/src/test/modules/index/sql/hot_chain.sql
new file mode 100644
index 000000000..dcbdb93b7
--- /dev/null
+++ b/src/test/modules/index/sql/hot_chain.sql
@@ -0,0 +1,37 @@
+-- Non-MVCC index scans must return every visible member of a HOT chain.
+-- Verify that table_index_getnext_slot gets that right in a variety of cases.
+
+CREATE EXTENSION test_indexscan;
+
+-- Single-page table; all TIDs below are deterministic: a fresh table gets
+-- block 0, heap_insert/heap_update assign line pointers sequentially, and the
+-- rows are tiny to avoid any toasting.
+CREATE TABLE hot_chain_tab (id int, filler text) WITH (autovacuum_enabled = off);
+CREATE INDEX hot_chain_idx ON hot_chain_tab (id);
+INSERT INTO hot_chain_tab VALUES (1, 'r1v1'), (2, 'r2v1');  -- (0,1) (0,2)
+
+BEGIN;
+
+-- Create HOT chains.  These updates touch only the non-indexed column
+-- "filler", and every chain member stays visible to SnapshotAny because the
+-- deleting transaction (us) is still open.
+UPDATE hot_chain_tab SET filler = 'r1v2' WHERE id = 1;   -- (0,3)
+UPDATE hot_chain_tab SET filler = 'r1v3' WHERE id = 1;   -- (0,4)
+UPDATE hot_chain_tab SET filler = 'r2v2' WHERE id = 2;   -- (0,5)
+
+-- Verify that all three updates were HOT using backend-local xact counter:
+SELECT pg_stat_get_xact_tuples_hot_updated('hot_chain_tab'::regclass) AS hot_updated;
+
+-- SnapshotAny: every chain member
+SELECT * FROM index_scan_tids('hot_chain_idx', 'any');
+
+-- MVCC scan, for contrast: only the newest member of each chain
+SELECT * FROM index_scan_tids('hot_chain_idx', 'mvcc');
+
+-- backward: index entries in reverse key order, chains still in ASC chain order
+SELECT * FROM index_scan_tids('hot_chain_idx', 'any', 'backward');
+
+COMMIT;
+
+DROP TABLE hot_chain_tab;
+DROP EXTENSION test_indexscan;
diff --git a/src/test/modules/index/test_indexscan--1.0.sql b/src/test/modules/index/test_indexscan--1.0.sql
new file mode 100644
index 000000000..b662d7054
--- /dev/null
+++ b/src/test/modules/index/test_indexscan--1.0.sql
@@ -0,0 +1,13 @@
+/* src/test/modules/index/test_indexscan--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_indexscan" to load this file. \quit
+
+-- Scan indexrel with zero scan keys and the given snapshot type ('mvcc',
+-- 'any', 'self', 'dirty', 'nonvacuumable') and direction ('forward',
+-- 'backward'), returning the heap TID of every tuple the scan yields, in
+-- scan order.
+CREATE FUNCTION index_scan_tids(indexrel regclass, snaptype text,
+                                dir text DEFAULT 'forward')
+RETURNS SETOF pg_catalog.tid
+STRICT AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/index/test_indexscan.c b/src/test/modules/index/test_indexscan.c
new file mode 100644
index 000000000..fabd4ae25
--- /dev/null
+++ b/src/test/modules/index/test_indexscan.c
@@ -0,0 +1,146 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_indexscan.c
+ *		Test helpers for low-level index scan behavior.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/index/test_indexscan.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "executor/tuptable.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/itemptr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/tuplestore.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * index_scan_tids(indexrel regclass, snaptype text, dir text) RETURNS SETOF tid
+ *
+ * Scan indexrel with zero scan keys, using a snapshot of the given type and
+ * the given scan direction, and return the heap TID of every tuple that the
+ * scan returns, in scan order.
+ */
+PG_FUNCTION_INFO_V1(index_scan_tids);
+Datum
+index_scan_tids(PG_FUNCTION_ARGS)
+{
+	Oid			indexoid = PG_GETARG_OID(0);
+	char	   *snaptype = text_to_cstring(PG_GETARG_TEXT_PP(1));
+	char	   *dirstr = text_to_cstring(PG_GETARG_TEXT_PP(2));
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Oid			heapoid;
+	Relation	heaprel;
+	Relation	indexrel;
+	ScanDirection dir;
+	SnapshotData snapdata;
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+	TupleTableSlot *slot;
+	bool		recheck;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser to use index scan test functions")));
+
+	InitMaterializedSRF(fcinfo, MAT_SRF_USE_EXPECTED_DESC);
+
+	if (strcmp(dirstr, "forward") == 0)
+		dir = ForwardScanDirection;
+	else if (strcmp(dirstr, "backward") == 0)
+		dir = BackwardScanDirection;
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid scan direction \"%s\"", dirstr)));
+
+	/*
+	 * Lock the heap before the index to avoid deadlock.  IndexGetRelation()
+	 * runs without a lock, so if the OID isn't an index it returns
+	 * InvalidOid; defer the complaint to index_open() below, which gives a
+	 * better message.
+	 */
+	heapoid = IndexGetRelation(indexoid, true);
+	if (OidIsValid(heapoid))
+		heaprel = table_open(heapoid, AccessShareLock);
+	else
+		heaprel = NULL;
+
+	indexrel = index_open(indexoid, AccessShareLock);
+
+	/*
+	 * Since the IndexGetRelation() call above ran without a lock, recheck now
+	 * that both relations are locked: a concurrent drop and recreate could
+	 * have left us with the wrong heap.
+	 */
+	if (heaprel == NULL || heapoid != IndexGetRelation(indexoid, false))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("could not open parent table of index \"%s\"",
+						RelationGetRelationName(indexrel))));
+
+	if (strcmp(snaptype, "mvcc") == 0)
+		snapshot = GetActiveSnapshot();
+	else if (strcmp(snaptype, "any") == 0)
+		snapshot = SnapshotAny;
+	else if (strcmp(snaptype, "self") == 0)
+		snapshot = SnapshotSelf;
+	else if (strcmp(snaptype, "dirty") == 0)
+	{
+		InitDirtySnapshot(snapdata);
+		snapshot = &snapdata;
+	}
+	else if (strcmp(snaptype, "nonvacuumable") == 0)
+	{
+		InitNonVacuumableSnapshot(snapdata, GlobalVisTestFor(heaprel));
+		snapshot = &snapdata;
+	}
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid snapshot type \"%s\"", snaptype)));
+
+	slot = table_slot_create(heaprel, NULL);
+
+	scan = index_beginscan(heaprel, indexrel, false, snapshot, NULL,
+						   0, 0, SO_NONE);
+	index_rescan(scan, NULL, 0, NULL, 0);
+
+	while (table_index_getnext_slot(scan, dir, slot, &recheck))
+	{
+		ItemPointerData tid = slot->tts_tid;
+		Datum		values[1];
+		bool		nulls[1];
+
+		/* with zero scan keys, no AM should ever request a recheck */
+		if (recheck)
+			elog(ERROR, "unexpected recheck request from keyless index scan");
+
+		values[0] = ItemPointerGetDatum(&tid);
+		nulls[0] = false;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	index_endscan(scan);
+	ExecDropSingleTupleTableSlot(slot);
+	index_close(indexrel, AccessShareLock);
+	table_close(heaprel, AccessShareLock);
+
+	return (Datum) 0;
+}
diff --git a/src/test/modules/index/test_indexscan.control b/src/test/modules/index/test_indexscan.control
new file mode 100644
index 000000000..bfab27bf0
--- /dev/null
+++ b/src/test/modules/index/test_indexscan.control
@@ -0,0 +1,4 @@
+comment = 'helper function for low-level index scan behavior'
+default_version = '1.0'
+module_pathname = '$libdir/test_indexscan'
+relocatable = true
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index bbb1db3c4..fe6f51a6a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -897,6 +897,68 @@ SELECT idx_scan, stats_reset IS NOT NULL AS has_stats_reset
         0 | t
 (1 row)
 
+-- Test pg_stat_all_indexes.idx_tup_read counter
+CREATE TEMPORARY TABLE test_idx_tup_read AS
+  SELECT g AS a FROM generate_series(1, 200) g;
+CREATE INDEX ON test_idx_tup_read(a);
+SET enable_seqscan = off;
+SET enable_indexonlyscan = off;
+-- plain index scan
+SET enable_bitmapscan = off;
+EXPLAIN (COSTS off) SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Aggregate
+   ->  Index Scan using test_idx_tup_read_a_idx on test_idx_tup_read
+         Index Cond: ((a >= 1) AND (a <= 200))
+(3 rows)
+
+SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+ count 
+-------
+   200
+(1 row)
+
+-- bitmap index scan
+SET enable_indexscan = off;
+SET enable_bitmapscan = on;
+EXPLAIN (COSTS off) SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Aggregate
+   ->  Bitmap Heap Scan on test_idx_tup_read
+         Recheck Cond: ((a >= 1) AND (a <= 200))
+         ->  Bitmap Index Scan on test_idx_tup_read_a_idx
+               Index Cond: ((a >= 1) AND (a <= 200))
+(5 rows)
+
+SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+ count 
+-------
+   200
+(1 row)
+
+RESET enable_seqscan;
+RESET enable_indexonlyscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+-- We expect a total of 400 tuples read (200 from plain index scan, 200 from
+-- bitmap index scan).  However, we only expect 200 tuple fetches, because
+-- bitmap index scans/heap scans don't affect the relevant counter.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT idx_tup_read, idx_tup_fetch FROM pg_stat_all_indexes
+  WHERE indexrelid = 'test_idx_tup_read_a_idx'::regclass;
+ idx_tup_read | idx_tup_fetch 
+--------------+---------------
+          400 |           200
+(1 row)
+
+DROP TABLE test_idx_tup_read;
 -----
 -- Test reset of some stats for shared table
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 610fd21fa..923d2cbc0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -396,6 +396,35 @@ SELECT pg_stat_reset_single_table_counters('test_last_scan_pkey'::regclass);
 SELECT idx_scan, stats_reset IS NOT NULL AS has_stats_reset
   FROM pg_stat_all_indexes WHERE indexrelid = 'test_last_scan_pkey'::regclass;
 
+-- Test pg_stat_all_indexes.idx_tup_read counter
+CREATE TEMPORARY TABLE test_idx_tup_read AS
+  SELECT g AS a FROM generate_series(1, 200) g;
+CREATE INDEX ON test_idx_tup_read(a);
+
+SET enable_seqscan = off;
+SET enable_indexonlyscan = off;
+-- plain index scan
+SET enable_bitmapscan = off;
+EXPLAIN (COSTS off) SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+-- bitmap index scan
+SET enable_indexscan = off;
+SET enable_bitmapscan = on;
+EXPLAIN (COSTS off) SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+SELECT count(*) FROM test_idx_tup_read WHERE a BETWEEN 1 AND 200;
+RESET enable_seqscan;
+RESET enable_indexonlyscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+
+-- We expect a total of 400 tuples read (200 from plain index scan, 200 from
+-- bitmap index scan).  However, we only expect 200 tuple fetches, because
+-- bitmap index scans/heap scans don't affect the relevant counter.
+SELECT pg_stat_force_next_flush();
+SELECT idx_tup_read, idx_tup_fetch FROM pg_stat_all_indexes
+  WHERE indexrelid = 'test_idx_tup_read_a_idx'::regclass;
+DROP TABLE test_idx_tup_read;
+
 -----
 -- Test reset of some stats for shared table
 -----
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c5db6ca67..6801894d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1332,8 +1332,6 @@ IndexDeleteCounts
 IndexDeletePrefetchState
 IndexDoCheckCallback
 IndexElem
-IndexFetchHeapData
-IndexFetchTableData
 IndexInfo
 IndexList
 IndexOnlyScan
@@ -1345,6 +1343,7 @@ IndexRuntimeKeyInfo
 IndexScan
 IndexScanDesc
 IndexScanDescData
+IndexScanHeapData
 IndexScanInstrumentation
 IndexScanState
 IndexStateFlagsAction
-- 
2.53.0



  [application/octet-stream] v28-0008-heapam-Add-index-scan-I-O-prefetching.patch (55.1K, 6-v28-0008-heapam-Add-index-scan-I-O-prefetching.patch)
  download | inline diff:
From 47bd703c89f60233763c9dd5ac179e45fe41b551 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 25 Mar 2026 16:58:09 -0400
Subject: [PATCH v28 08/11] heapam: Add index scan I/O prefetching.

This commit implements I/O prefetching for index scans (and index-only
scans that require heap fetches). This was made possible by the recent
addition of batching interfaces to both the table AM and index AM APIs.

The amgetbatch index AM interface provides batches of matching TIDs
(rather than one tuple at a time), each of which must be taken from
index tuples that appear together on a single index page.  This allows
multiple batches to be held open simultaneously.  Giving the table AM an
explicit understanding of index AM concepts/index page boundaries allows
it to consider all of the relevant costs and benefits.

Prefetching is implemented using a prefetching position under the
control of the table AM.  This is closely related to the scan position
added by commit FIXME, which introduced the amgetbatch interface.  A
read stream callback advances the read stream as needed to provide
sufficiently many heap block numbers to maintain the read stream's
target prefetch distance.

Testing has shown that index prefetching can make index scans much
faster.  Large range scans that return many tuples can be as much as 30x
faster with local SSDs when buffered I/O is used, and 50x faster or more
with higher-latency storage such as network-attached block devices,
where the benefit of hiding I/O latency through prefetching is even
greater.

An important goal of the amgetbatch design is to enable the table AM's
read stream callback to advance its prefetch position using TIDs that
appear on a leaf page that's ahead of the current scan position's leaf
page.  This is crucial with scans of indexes where each leaf page
happens to have relatively few distinct heap blocks among its matching
TIDs (as well as with scans with leaf pages that have relatively few
total matching items).  Index scans can have as many as 64 open batches,
which testing has shown to be about the maximum number that can ever be
useful.  Batches are maintained in scan order using a simple ring buffer
data structure.

In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
the read stream is temporarily paused using the read stream pausing
mechanism added by commit 38229cb9.  Prefetching (via the read stream)
is resumed only after the scan position advances beyond its current open
batch and then frees and removes the batch from the scan's batch ring
buffer.  Testing has shown that it isn't very common for scans to hold
open more than about 10 batches to get the desired I/O prefetch
distance.

The heuristic used to decide when to begin prefetching delays
initialization of the scan's read stream until the scan must read a
fourth heap page.  Note that the rule is the same for index-only scans.
As a result, index-only scans won't create a read stream whenever they
require no (or only very few) heap fetches.

A new GUC (enable_indexscan_prefetch) controls the use of index
prefetching.  The default setting is 'on', so all amgetbatch index scans
use prefetching.  Index-only scans apply the usual "start prefetching on
the fourth heap page" test to gate prefetching, and so will never create
a read stream in cases where all (or almost all) relevant visibility map
bits are set.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/include/access/heapam.h                   |  16 +-
 src/include/access/indexbatch.h               | 228 +++++++++-
 src/include/access/relscan.h                  |   7 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_indexscan.c    | 417 +++++++++++++++++-
 src/backend/access/index/indexbatch.c         |  35 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  16 +
 doc/src/sgml/indexam.sgml                     | 107 ++++-
 doc/src/sgml/tableam.sgml                     |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/tools/pgindent/typedefs.list              |   1 +
 14 files changed, 825 insertions(+), 22 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 71b6420c9..986b5dbe9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -135,7 +135,21 @@ typedef struct IndexScanHeapData
 	/* Plain index scan xs_lastinblock optimization */
 	bool		xs_lastinblock; /* last TID on this block in current batch? */
 
-	uint16		xs_blkswitch_count; /* number of heap blocks fetched */
+	/*
+	 * Read stream state for prefetching (only used during amgetbatch scans).
+	 *
+	 * The read stream moves ahead of the scan's current position using its
+	 * own prefetching position (per the tableam_util_prefetchpos_*
+	 * conventions from indexbatch.h).  The read stream is allocated early in
+	 * the scan, and reset on rescan (and when the scan direction changes).
+	 */
+	bool		xs_paused;		/* paused until next batch is read? */
+	bool		xs_prefetching_safe;	/* prefetching is safe? */
+	uint16		xs_blkswitch_count; /* determines when to prefetch */
+
+	ScanDirection xs_read_stream_dir;	/* index scan direction */
+	BlockNumber xs_prefetch_block;	/* last block returned to xs_read_stream */
+	ReadStream *xs_read_stream; /* prefetching read stream */
 
 	/* Per-tuple context for padding "name" columns during index-only scans */
 	MemoryContext xs_itup_cxt;
diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h
index 24b531705..d765059e9 100644
--- a/src/include/access/indexbatch.h
+++ b/src/include/access/indexbatch.h
@@ -195,6 +195,41 @@ index_scan_batch_index_opaque_dyn(IndexScanDesc scan, IndexScanBatch batch)
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Compare two batch ring positions in the given scan direction.
+ *
+ * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is
+ * ahead of pos2.
+ */
+static inline int
+index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2,
+				   ScanDirection direction)
+{
+	int8		batchdiff;
+
+	Assert(pos1->valid && pos2->valid);
+
+	batchdiff = (int8) (pos1->batch - pos2->batch);
+
+	Assert(batchdiff > -INDEX_SCAN_MAX_BATCHES &&
+		   batchdiff < INDEX_SCAN_MAX_BATCHES);
+
+	if (batchdiff != 0)
+	{
+		/* Resolve comparison using differing batch offsets */
+		return batchdiff;
+	}
+
+	/*
+	 * Resolve comparison using items[]-wise indexes from caller's positions,
+	 * since both positions point to the same ring buffer batch
+	 */
+	if (ScanDirectionIsForward(direction))
+		return pos1->item - pos2->item;
+	else
+		return pos2->item - pos1->item;
+}
+
 /*
  * Advance position to its next item in the batch.
  *
@@ -296,6 +331,7 @@ tableam_util_batchscan_init(IndexScanDesc scan)
 	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
 
 	scan->batchringbuf.scanPos.valid = false;
+	scan->batchringbuf.prefetchPos.valid = false;
 	scan->batchringbuf.markPos.valid = false;
 
 	scan->batchringbuf.markBatch = NULL;
@@ -345,7 +381,7 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction,
 
 	/*
 	 * scanPos is valid, so scanBatch must already be loaded in batch ring
-	 * buffer.  We rely on that here.
+	 * buffer.  We rely on that here (can't do this with prefetchBatch).
 	 */
 	pg_assume(batchringbuf->headBatch == scanPos->batch);
 
@@ -357,9 +393,9 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction,
 /*
  * Fetch the next batch of matching items for the scan (or the first).
  *
- * Called when caller's current batch (passed to us as priorBatch) has no more
- * matching items in the given scan direction.  Caller passes a NULL
- * priorBatch on the first call here for the scan.
+ * Called when caller's current scanBatch/prefetchBatch (passed to us as
+ * priorBatch) has no more matching items in the given scan direction.  Caller
+ * passes a NULL priorBatch on the first call here for the scan.
  *
  * Returns the next batch to be processed by caller in the given scan
  * direction, or NULL when there are no more matches in that direction.
@@ -368,7 +404,7 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction,
  *
  * We don't free any batches here; that is a separate step performed by
  * tableam_util_scanpos_nextbatch.  Caller also needs to advance their
- * position to the start of the returned batch.
+ * scanPos/prefetchPos position to the start of the returned batch.
  */
 static pg_attribute_always_inline IndexScanBatch
 tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction,
@@ -482,13 +518,19 @@ tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction,
  * now-obsolescent old scanBatch (the ring buffer's head batch), freeing up
  * its ring buffer slot.  (When newScanBatch is the scan's first batch, there
  * is no old scanBatch for us to release.)
+ *
+ * Return value indicates if a previously occupied ring buffer slot was freed.
+ * A table AM that paused its prefetch mechanism because the ring buffer was
+ * full (see tableam_util_prefetchpos_advance) can resume it when we return
+ * true (to indicate to caller that there's now space to store another batch).
  */
-static pg_attribute_always_inline void
+static pg_attribute_always_inline bool
 tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction,
 							   IndexScanBatch newScanBatch)
 {
 	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
 	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
 	bool		releaseOldHeadBatch = scanPos->valid;
 	IndexScanBatch headBatch;
 
@@ -500,7 +542,7 @@ tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction,
 	{
 		/* newScanBatch is the scan's first and only batch */
 		Assert(batchringbuf->headBatch == scanPos->batch);
-		return;
+		return false;
 	}
 
 	headBatch = index_scan_batch(scan, batchringbuf->headBatch);
@@ -511,12 +553,184 @@ tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction,
 	/* free obsolescent head batch (unless it is scan's markBatch) */
 	tableam_util_release_batch(scan, headBatch);
 
+	/*
+	 * If we're about to release the batch that prefetchPos currently points
+	 * to, just invalidate prefetchPos.  This keeps prefetchPos from ever
+	 * falling behind scanPos at the batch granularity, which
+	 * tableam_util_prefetchpos_catchup relies on.
+	 */
+	if (prefetchPos->valid &&
+		prefetchPos->batch == batchringbuf->headBatch)
+		prefetchPos->valid = false;
+
 	/* Remove the batch from the ring buffer (even if it's markBatch) */
 	batchringbuf->headBatch++;
 
 	/* Postconditions for having freed up a ring buffer slot */
+	Assert(!prefetchPos->valid ||
+		   index_scan_batch_loaded(scan, prefetchPos->batch));
 	Assert(!index_scan_batch_full(scan));
 	Assert(batchringbuf->headBatch == scanPos->batch);
+
+	return true;
+}
+
+/*
+ * Handle initialization of the scan's prefetchPos, when prefetchPos isn't
+ * yet valid (also handles the prefetchPos < scanPos edge case).
+ *
+ * Called at the start of each table AM prefetch callback call.  Returns true
+ * after setting prefetchPos to the scan's current scanPos.  That's a special
+ * case: the prefetch callback should process the very item that the scan is
+ * on directly (e.g., by returning that item's table block to its read
+ * stream), rather than reading ahead of the scan.  Returns false when
+ * prefetchPos is ahead of (or equal to) scanPos, in which case the prefetch
+ * callback picks up from where its last call left off.
+ */
+static inline bool
+tableam_util_prefetchpos_catchup(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
+
+	/*
+	 * scanPos must always be valid when prefetching takes place.  There has
+	 * to be at least one batch, loaded as the scan's scanBatch.
+	 */
+	Assert(index_scan_batch_count(scan) > 0);
+	Assert(scanPos->valid && index_scan_batch_loaded(scan, scanPos->batch));
+
+	/*
+	 * prefetchPos can "fall behind" scanPos at the item granularity: the
+	 * prefetch callback only runs on demand, so scanPos can overtake
+	 * prefetchPos whenever the scan consumes items without the callback being
+	 * called (e.g., runs of adjacent matching items whose TIDs all point to
+	 * the same table block).  We handle that case using exactly the same
+	 * steps as initialization.
+	 *
+	 * prefetchPos can never fall behind scanPos at the batch granularity,
+	 * since tableam_util_scanpos_nextbatch invalidates prefetchPos before
+	 * releasing the batch that prefetchPos points to.  There is therefore no
+	 * danger of prefetchPos.batch falling so far behind scanPos.batch that it
+	 * wraps around (and appears to be ahead of scanPos instead of behind it).
+	 */
+	if (!prefetchPos->valid ||
+		index_scan_pos_cmp(scanPos, prefetchPos, direction) > 0)
+	{
+		*prefetchPos = *scanPos;
+		return true;
+	}
+
+	/* Picking up prefetching from where the last callback call left off */
+	Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) <= 0);
+	return false;
+}
+
+/*
+ * Result of a tableam_util_prefetchpos_advance call
+ */
+typedef enum BatchPosAdvanceResult
+{
+	BATCH_POS_ADVANCED,			/* advanced to next item in current batch */
+	BATCH_POS_BATCH_ADVANCED,	/* advanced to first item of new batch */
+	BATCH_POS_DONE,				/* no further matching items in direction */
+	BATCH_POS_RING_FULL,		/* couldn't advance; ring buffer full */
+} BatchPosAdvanceResult;
+
+/*
+ * Advance the scan's prefetchPos to the next item that the table AM's
+ * prefetch callback should consider reading ahead, moving in the given scan
+ * direction.
+ *
+ * On entry, *prefetchBatch must be the batch that prefetchPos points to.
+ * Advances prefetchPos to the next item within *prefetchBatch when possible
+ * (returns BATCH_POS_ADVANCED).  Otherwise tries to advance to the scan's
+ * next batch, setting *prefetchBatch to the new batch and positioning
+ * prefetchPos at its first item in the scan direction (returns
+ * BATCH_POS_BATCH_ADVANCED).  Callers must use the returned result (never
+ * compare *prefetchBatch against its earlier value) to detect this case;
+ * batch recycling can reuse the memory of a recently released batch.
+ *
+ * Returns BATCH_POS_DONE when there are no further matching items in the
+ * given scan direction (*prefetchBatch is set to NULL).
+ *
+ * Returns BATCH_POS_RING_FULL when the next batch couldn't be loaded because
+ * all available ring buffer batch slots are currently in use (prefetchPos
+ * and *prefetchBatch are left unchanged).  Caller responds by momentarily
+ * pausing its read-ahead mechanism; it can be resumed once
+ * tableam_util_scanpos_nextbatch reports that the scan freed up a slot
+ * (which'll happen only after scanPos has consumed all remaining items from
+ * the scan's current scanBatch).
+ *
+ * When caller passes throttle=true we likewise decline to advance to the next
+ * batch and return BATCH_POS_RING_FULL instead.  Caller uses this to cap how
+ * many batches a single read-ahead callback invocation can advance by.
+ * Advancing within the current batch (BATCH_POS_ADVANCED) ignores throttle,
+ * so throttling only takes effect at a batch boundary.
+ */
+static inline BatchPosAdvanceResult
+tableam_util_prefetchpos_advance(IndexScanDesc scan, ScanDirection direction,
+								 IndexScanBatch *prefetchBatch,
+								 BatchRingItemPos *prefetchPos,
+								 bool throttle)
+{
+	if (!index_scan_pos_advance(direction, *prefetchBatch, prefetchPos))
+	{
+		/*
+		 * Ran out of items from prefetchBatch.  Try to advance to the scan's
+		 * next batch.
+		 */
+		if (unlikely(index_scan_batch_full(scan)) || unlikely(throttle))
+		{
+			/*
+			 * Can't advance prefetchBatch because all available ring buffer
+			 * batch slots are currently in use (or because caller wants us to
+			 * throttle instead of returning another batch).  Undo the changes
+			 * we've already made to prefetchPos before returning, leaving it
+			 * in a state that's consistent with the work actually performed
+			 * (various positional state assertions expect this).
+			 */
+			if (ScanDirectionIsForward(direction))
+			{
+				Assert(prefetchPos->item == (*prefetchBatch)->lastItem + 1);
+				prefetchPos->item--;
+			}
+			else				/* ScanDirectionIsBackward */
+			{
+				Assert(prefetchPos->item == (*prefetchBatch)->firstItem - 1);
+				prefetchPos->item++;
+			}
+
+			return BATCH_POS_RING_FULL;
+		}
+
+		/* We have a free ring buffer slot to fit another batch */
+		*prefetchBatch = tableam_util_fetch_next_batch(scan, direction,
+													   *prefetchBatch,
+													   prefetchPos);
+		if (*prefetchBatch == NULL)
+		{
+			/*
+			 * Deliberately leave prefetchPos in "just-before-start" or
+			 * "just-after-end" position
+			 */
+			return BATCH_POS_DONE;
+		}
+
+		/*
+		 * Have a new prefetchBatch.
+		 *
+		 * tableam_util_fetch_next_batch already appended the new batch to the
+		 * ring buffer for us, but we must advance prefetchPos ourselves.
+		 * Position prefetchPos to the start of the new batch.
+		 */
+		index_scan_pos_startbatch(direction, *prefetchBatch, prefetchPos);
+
+		return BATCH_POS_BATCH_ADVANCED;
+	}
+
+	return BATCH_POS_ADVANCED;
 }
 
 /*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a1e616d3..18c35a6f4 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -186,6 +186,10 @@ typedef struct IndexScanBatchData
 	 * This allows table AMs to avoid redundant amgetbatch calls with the same
 	 * priorbatch -- the index AM might need to read additional index pages to
 	 * determine there are no more matching items beyond caller's priorbatch.
+	 * In particular, during prefetching the read stream callback discovers
+	 * the end-of-scan via prefetchBatch.  tableam_util_fetch_next_batch()
+	 * checks these flags so that the scan side doesn't repeat the same
+	 * amgetbatch call when it later reaches that batch as scanBatch.
 	 */
 	bool		knownEndBackward;
 	bool		knownEndForward;
@@ -236,11 +240,14 @@ typedef struct IndexScanBatchData *IndexScanBatch;
  * current read position by _multiple_ batches/index pages.  The further out
  * the table AM reads ahead like this, the further it can see into the future.
  * That way the table AM is able to reorder work as aggressively as desired.
+ * Index scans sometimes need to readahead by several dozen batches in order
+ * to maintain an optimal I/O prefetch distance (for reading table blocks).
  */
 typedef struct BatchRingBuffer
 {
 	/* current positions in IndexScanDescData.batchbuf[] for scan */
 	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos prefetchPos;	/* prefetching position */
 	BatchRingItemPos markPos;	/* mark/restore position */
 
 	/* markPos's batch (not in ring buffer when markBatch != scanBatch) */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d315..419300a6b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index e9b1ea851..5ea3c0cca 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -19,11 +19,24 @@
 #include "access/indexbatch.h"
 #include "access/relscan.h"
 #include "access/visibilitymap.h"
+#include "optimizer/cost.h"
 #include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
+/*
+ * We avoid creating a read stream during very selective scans that require
+ * few heap fetches, where the overhead of creating a read stream is unlikely
+ * to pay for itself
+ */
+#define INDEX_PREFETCH_BLKSWITCH_THRESHOLD 4
+
+/*
+ * Maximum number of batches that a single heapam_index_prefetch_next_block
+ * call may advance prefetchBatch by without returning a heap block
+ */
+#define INDEX_PREFETCH_MAX_BATCH_ADVANCES 1
 
 /*
  * heapam's per-batch private opaque area (only used during index-only scans).
@@ -91,6 +104,14 @@ static void heapam_index_batch_pos_visibility(IndexScanDesc scan,
 											  IndexScanBatch batch,
 											  HeapBatchData *hbatch,
 											  BatchRingItemPos *pos);
+static pg_noinline void heapam_index_dirchange_reset(IndexScanDesc scan,
+													 IndexScanHeapData *hscan,
+													 ScanDirection direction);
+static pg_attribute_always_inline void heapam_index_consider_prefetching(IndexScanDesc scan,
+																		 IndexScanHeapData *hscan);
+static BlockNumber heapam_index_prefetch_next_block(ReadStream *stream,
+													void *callback_private_data,
+													void *per_buffer_data);
 
 /*
  * Simple, single-shot TID lookup for constraint enforcement code (unique
@@ -157,6 +178,10 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 	/* xs_lastinblock optimization state */
 	Assert(!hscan->xs_lastinblock);
 
+	/* Read stream state (other fields initialized by callback) */
+	Assert(hscan->xs_read_stream_dir == NoMovementScanDirection);
+	Assert(hscan->xs_read_stream == NULL);
+
 	/* Resolve which xs_getnext_slot implementation to use for this scan */
 	if (scan->indexRelation->rd_indam->amgetbatch != NULL)
 	{
@@ -180,6 +205,16 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 
 		/* Set up scan's batch ring buffer */
 		tableam_util_batchscan_init(scan);
+
+		/*
+		 * We can only safely prefetch during scans where we're able to
+		 * unguard (unpin) each batch's buffers right away (MVCC scans).  We
+		 * are not prepared to sensibly limit the total number of buffer pins
+		 * held.  The read stream handles all pin resource management for us,
+		 * and knows nothing about pins held on index pages/within batches.
+		 * (It's also convenient for enable_indexscan_prefetch to gate he.)
+		 */
+		hscan->xs_prefetching_safe = scan->MVCCScan && enable_indexscan_prefetch;
 	}
 	else
 	{
@@ -188,6 +223,9 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 			scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot;
 		else
 			scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot;
+
+		/* Prefetching isn't support during amgettuple scans */
+		hscan->xs_prefetching_safe = false;
 	}
 
 	/*
@@ -239,6 +277,15 @@ heapam_index_scan_rescan(IndexScanDesc scan)
 	/* Heap fetches from the last rescan don't count towards this limit  */
 	hscan->xs_blkswitch_count = 0;
 
+	/* Defensively do an unconditional read stream direction reset */
+	hscan->xs_read_stream_dir = NoMovementScanDirection;
+
+	if (hscan->xs_read_stream)
+	{
+		hscan->xs_paused = false;
+		read_stream_reset(hscan->xs_read_stream);
+	}
+
 	/* Reset batch ring buffer state */
 	if (scan->usebatchring)
 		tableam_util_batchscan_reset(scan, false);
@@ -263,6 +310,9 @@ heapam_index_scan_end(IndexScanDesc scan)
 	if (BufferIsValid(hscan->xs_vmbuffer))
 		ReleaseBuffer(hscan->xs_vmbuffer);
 
+	if (hscan->xs_read_stream)
+		read_stream_end(hscan->xs_read_stream);
+
 	/* Free the index-only scan name-column context, if any */
 	if (hscan->xs_itup_cxt)
 		MemoryContextDelete(hscan->xs_itup_cxt);
@@ -292,9 +342,17 @@ heapam_index_scan_markpos(IndexScanDesc scan)
 void
 heapam_index_scan_restrpos(IndexScanDesc scan)
 {
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+
 	Assert(scan->usebatchring);
 	Assert(scan->indexRelation->rd_indam->amcanmarkpos);
 
+	if (hscan->xs_read_stream)
+	{
+		hscan->xs_paused = false;
+		read_stream_reset(hscan->xs_read_stream);
+	}
+
 	tableam_util_batchscan_restore_pos(scan);
 }
 
@@ -627,6 +685,15 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 	bool		all_visible = false;
 	ItemPointer tid = NULL;
 
+	/*
+	 * Changing the scan direction mid-scan requires an MVCC snapshot: with
+	 * any other snapshot type, more than one member of a HOT chain can be
+	 * visible, and resuming a partially-returned chain only works in the
+	 * forward direction.  All non-MVCC callers scan in one fixed direction.
+	 */
+	Assert(scan->MVCCScan || !amgetbatch ||
+		   hscan->xs_read_stream_dir == NoMovementScanDirection ||
+		   hscan->xs_read_stream_dir == direction);
 	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(index_only || scan->xs_visited_pages_limit == 0);
 
@@ -787,9 +854,13 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
 		/*
-		 * We're switching to a new heap block, so count it
+		 * We're switching to a new heap block, so count it; once enough
+		 * distinct blocks are fetched, start prefetching (though only if we
+		 * haven't already)
 		 */
-		hscan->xs_blkswitch_count++;
+		if (hscan->xs_read_stream == NULL &&
+			++hscan->xs_blkswitch_count == INDEX_PREFETCH_BLKSWITCH_THRESHOLD)
+			heapam_index_consider_prefetching(scan, hscan);
 
 		/*
 		 * Drop the xs_blk pin independently held on by slot (if any) now,
@@ -803,7 +874,14 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
-		hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
+		/*
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
+		 */
+		if (hscan->xs_read_stream)
+			hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -930,6 +1008,12 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexScanHeapData *hscan,
 
 	Assert(all_visible == NULL || scan->xs_want_itup);
 
+	/* Handle resetting the read stream when scan direction changes */
+	if (hscan->xs_read_stream_dir == NoMovementScanDirection)
+		hscan->xs_read_stream_dir = direction;	/* first call */
+	else if (unlikely(hscan->xs_read_stream_dir != direction))
+		heapam_index_dirchange_reset(scan, hscan, direction);
+
 	/*
 	 * Attempt to increment the position of any existing loaded scanBatch
 	 * (always fails on first call here for the scan)
@@ -973,7 +1057,25 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexScanHeapData *hscan,
 	 * also remove the old head batch/scanBatch from the batch ring buffer,
 	 * and release the underlying batch storage.
 	 */
-	tableam_util_scanpos_nextbatch(scan, direction, scanBatch);
+	if (tableam_util_scanpos_nextbatch(scan, direction, scanBatch))
+	{
+		/* A previously occupied ring buffer slot was freed */
+		if (unlikely(hscan->xs_paused))
+		{
+			/*
+			 * heapam_index_prefetch_next_block paused the scan's read stream
+			 * due to our running out of batch slots (or it "throttled" the
+			 * read stream to avoid reading too far ahead in the index).
+			 *
+			 * Now that the scanBatch that was current when we paused has been
+			 * removed from the batch ring buffer, we must resume prefetching.
+			 */
+			read_stream_resume(hscan->xs_read_stream);
+			hscan->xs_paused = false;
+		}
+	}
+
+	Assert(!hscan->xs_paused);
 
 	/*
 	 * Set scanPos to first item for newly loaded scanBatch; return the new
@@ -1099,6 +1201,13 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan,
  * (important for inner index scans of anti-joins and semi-joins), and the
  * need to unguard batches promptly.
  *
+ * In no event will the scan be allowed to guard more than one batch at a
+ * time.  The primary reason for this restriction is to avoid unintended
+ * interactions with the read stream, which has its own strategy for keeping
+ * the number of pins held by the backend under control.  (Unguarding via
+ * the amunguardbatch callback often means releasing a buffer pin on an
+ * index page, which counts against the same shared pin limit.)
+ *
  * Once we've resolved visibility for all items in a batch, we can safely
  * unguard it by calling amunguardbatch.  This is safe with respect to
  * concurrent VACUUM because the batch's guard (typically a buffer pin on the
@@ -1247,3 +1356,303 @@ heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction,
 	else
 		hscan->xs_vm_items = scan->maxitemsbatch;
 }
+
+/*
+ * Handle a change in index scan direction (at the tuple granularity).
+ *
+ * Resets the read stream, since we can't rely on scanPos continuing to agree
+ * with the blocks that read stream already consumed using prefetchPos.
+ *
+ * Note: iff the scan _continues_ in this new direction, and actually steps
+ * off scanBatch to an earlier index page, tableam_util_fetch_next_batch will
+ * deal with it.  But that might never happen; the scan might yet change
+ * direction again (or just end before returning more items).
+ */
+static pg_noinline void
+heapam_index_dirchange_reset(IndexScanDesc scan, IndexScanHeapData *hscan,
+							 ScanDirection direction)
+{
+	/* Reset read stream state */
+	scan->batchringbuf.prefetchPos.valid = false;
+	hscan->xs_paused = false;
+	hscan->xs_read_stream_dir = direction;
+	hscan->xs_blkswitch_count = 0;
+
+	/* Reset read stream itself */
+	if (hscan->xs_read_stream)
+		read_stream_reset(hscan->xs_read_stream);
+}
+
+/*
+ * Start a read stream for heap block prefetching during an index scan
+ */
+static pg_attribute_always_inline void
+heapam_index_consider_prefetching(IndexScanDesc scan,
+								  IndexScanHeapData *hscan)
+{
+	Assert(hscan->xs_blk != InvalidBlockNumber);
+	Assert(!hscan->xs_read_stream);
+	Assert(!scan->batchringbuf.prefetchPos.valid);
+
+	if (!hscan->xs_prefetching_safe)
+		return;
+
+	hscan->xs_read_stream =
+		read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+								   scan->heapRelation, MAIN_FORKNUM,
+								   heapam_index_prefetch_next_block, scan, 0);
+}
+
+/*
+ * Return the next block to the read stream when performing index prefetching.
+ *
+ * The initial batch is always loaded by heapam_index_getnext_scanbatch_pos.
+ * We don't get called until the first read_stream_next_buffer call, when a
+ * heap block is requested from the scan's stream for the first time.
+ *
+ * The position of the read stream is stored in prefetchPos, which typically
+ * stays ahead of scanPos (the scan's read position).  When we return, we
+ * always leave scanPos <= prefetchPos.
+ */
+static BlockNumber
+heapam_index_prefetch_next_block(ReadStream *stream,
+								 void *callback_private_data,
+								 void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos PG_USED_FOR_ASSERTS_ONLY = &batchringbuf->scanPos;
+	BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
+	ScanDirection direction = hscan->xs_read_stream_dir;
+	IndexScanBatch prefetchBatch;
+	HeapBatchData *hbatch = NULL;
+	int			nbatchadvances_this_call = 0;
+
+	Assert(!hscan->xs_paused && hscan->xs_prefetching_safe);
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * Handle initialization of prefetchPos: set it from the scan's current
+	 * scanPos when it isn't already (validly) ahead of scanPos.  This is
+	 * required during the first call here for the scan (and in certain edge
+	 * cases).  See tableam_util_prefetchpos_catchup for full details.
+	 */
+	if (tableam_util_prefetchpos_catchup(scan, direction))
+	{
+		BatchMatchingItem *item;
+
+		/* prefetchPos has been initialized from scanPos for us */
+		prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+
+		/*
+		 * We must avoid keeping any batch guarded for more than an instant,
+		 * to avoid undesirable interactions with the scan's read stream. See
+		 * comment and assertion at the top of the loop below.
+		 */
+		if (scan->xs_want_itup)
+		{
+			/*
+			 * Index-only scan batches aren't unguarded immediately.  Deal
+			 * with that.
+			 */
+			hbatch = index_scan_batch_table_area(scan, prefetchBatch);
+
+			/*
+			 * The requested item can't be all-visible according to its
+			 * batch's cached visibility information; if it were, we'd never
+			 * have been called in the first place
+			 */
+			Assert(HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item) &&
+				   !hbatch->batchvis[prefetchPos->item]);
+
+			/*
+			 * Load any visibility info not already set through scanBatch, so
+			 * that scanBatch/prefetchBatch is unguarded right away
+			 */
+			hscan->xs_vm_items = scan->maxitemsbatch;	/* must unguard */
+			if (prefetchBatch->isGuarded)
+				heapam_index_batch_pos_visibility(scan, direction,
+												  prefetchBatch, hbatch,
+												  prefetchPos);
+
+			/*
+			 * Later calls to heapam_index_batch_pos_visibility will always
+			 * unguard batches right away, which we rely on in the loop below
+			 */
+		}
+
+		Assert(!prefetchBatch->isGuarded);
+
+		item = &prefetchBatch->items[prefetchPos->item];
+		hscan->xs_prefetch_block = ItemPointerGetBlockNumber(&item->tableTid);
+
+		/*
+		 * Special case: when we return, prefetchPos won't be ahead of scanPos
+		 * (it'll just be equal to scanPos).  We're merely fetching through a
+		 * read stream; true prefetching hasn't really started yet.
+		 */
+		Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) == 0);
+
+		return hscan->xs_prefetch_block;
+	}
+
+	/*
+	 * We're picking up prefetching from where the last call here left off
+	 */
+	Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) <= 0);
+	prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
+	if (scan->xs_want_itup)
+		hbatch = index_scan_batch_table_area(scan, prefetchBatch);
+
+	/*
+	 * Assert in passing that xs_prefetch_block matches the last item we
+	 * returned.
+	 *
+	 * Note: we don't actually need a xs_prefetch_block field at all; we could
+	 * just take the last block we returned from prefetchPos directly instead.
+	 * But maintaining xs_prefetch_block explicitly is slightly more robust.
+	 * It gives us a way to make sure that the last call here left prefetchPos
+	 * in a consistent state (e.g., when the read stream had to be paused).
+	 */
+#ifdef USE_ASSERT_CHECKING
+	{
+		BatchMatchingItem *lastitem = &prefetchBatch->items[prefetchPos->item];
+		BlockNumber last_block = ItemPointerGetBlockNumber(&lastitem->tableTid);
+
+		/*
+		 * Note: when a previous call paused the read stream, prefetchPos
+		 * might point to an item whose TID doesn't match last_block.  This
+		 * can only happen when the item was never returned due to it being
+		 * all-visible.
+		 */
+		Assert(last_block == hscan->xs_prefetch_block ||
+			   (hbatch && HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item) &&
+				hbatch->batchvis[prefetchPos->item]));
+	}
+#endif
+
+	for (;;)
+	{
+		BatchMatchingItem *item;
+		BlockNumber prefetch_block;
+		bool		throttle;
+
+		/*
+		 * We never call amgetbatch without immediately unguarding the batch
+		 * once prefetching begins.  That way index AMs won't hold onto any
+		 * "extra" index page pins needed as TID recycling interlock guards.
+		 *
+		 * This is defensive.  The read stream tries to be careful about not
+		 * pinning too many buffers, and that's harder to do reliably if there
+		 * are variable numbers of pins taken without such care.
+		 */
+		Assert(!prefetchBatch->isGuarded);
+
+		/*
+		 * Before advancing prefetchPos, consider if read stream's current
+		 * call here already advanced prefetchBatch.  This is possible during
+		 * index-only scans with long runs of batches containing only items
+		 * that are all-visible (it's also possible during plain index scans
+		 * with unusual batch layouts, though that's much less common).
+		 *
+		 * When we detect this condition, we forcibly throttle prefetching,
+		 * which pauses the read stream.  That'll give scanPos the opportunity
+		 * to return the next item to the scan.  We impose a ceiling on how
+		 * far prefetchBatch can get ahead of scanBatch without our producing
+		 * even one additional heap block for the read stream to prefetch.
+		 */
+		throttle = nbatchadvances_this_call >= INDEX_PREFETCH_MAX_BATCH_ADVANCES;
+
+		/* Increment prefetchPos to determine the next item to prefetch */
+		switch (tableam_util_prefetchpos_advance(scan, direction,
+												 &prefetchBatch, prefetchPos,
+												 throttle))
+		{
+			case BATCH_POS_ADVANCED:
+				/* Advanced to next item in current/previous prefetchBatch */
+				break;
+			case BATCH_POS_BATCH_ADVANCED:
+				/* Advanced to first item in new prefetchBatch */
+				nbatchadvances_this_call++;
+				if (hbatch)
+				{
+					/*
+					 * Extra heapam-specific step: bulk-load visibility info
+					 * up front to unguard batch immediately
+					 */
+					Assert(scan->xs_want_itup);
+
+					hbatch = index_scan_batch_table_area(scan, prefetchBatch);
+
+					Assert(hscan->xs_vm_items == scan->maxitemsbatch);
+					if (prefetchBatch->isGuarded)
+						heapam_index_batch_pos_visibility(scan, direction,
+														  prefetchBatch,
+														  hbatch, prefetchPos);
+				}
+				break;
+			case BATCH_POS_DONE:
+				/* No more batches in this scan direction */
+				return InvalidBlockNumber;
+			case BATCH_POS_RING_FULL:
+
+				/*
+				 * Edge case: Ran out of items from prefetchBatch, but can't
+				 * advance to the scan's next batch right now (all available
+				 * batchringbuf batch slots are currently in use).  This also
+				 * happens when we deliberately throttled prefetching.
+				 *
+				 * Deal with this by momentarily pausing the read stream.
+				 * heapam_index_getnext_scanbatch_pos will resume the read
+				 * stream later, though only after scanPos has consumed all
+				 * remaining items from scanBatch (at which point the current
+				 * head batch will be freed, making a slot available for
+				 * reuse).
+				 */
+				hscan->xs_paused = true;
+				return read_stream_pause(stream);
+		}
+
+		/*
+		 * prefetchPos now points to the next item whose TID's heap block
+		 * number might need to be prefetched.
+		 *
+		 * scanPos must be < prefetchPos when we return from this loop path.
+		 */
+		Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) < 0);
+
+		if (hbatch)
+		{
+			Assert(scan->xs_want_itup);
+			Assert(HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item));
+
+			if (hbatch->batchvis[prefetchPos->item])
+			{
+				/* item is known to be all-visible -- don't prefetch */
+				continue;
+			}
+		}
+
+		item = &prefetchBatch->items[prefetchPos->item];
+		prefetch_block = ItemPointerGetBlockNumber(&item->tableTid);
+
+		if (prefetch_block == hscan->xs_prefetch_block)
+		{
+			/*
+			 * prefetch_block matches the last prefetchPos item's TID's heap
+			 * block number; we must not return the same prefetch_block twice
+			 * (twice in succession)
+			 */
+			continue;
+		}
+
+		/* We have a new heap block number to return to read stream */
+		hscan->xs_prefetch_block = prefetch_block;
+		return prefetch_block;
+	}
+
+	pg_unreachable();
+
+	return InvalidBlockNumber;
+}
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 2e2ccf6a9..dce2b2a55 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -5,15 +5,21 @@
  *
  * This module provides the core infrastructure for batch-based index scans,
  * which allow index AMs to return multiple matching TIDs per page in a single
- * call.  The batch ring buffer is owned by the table AM.
+ * call.  The batch ring buffer is owned by the table AM, typically maintained
+ * alongside a read stream used for prefetching table blocks.
  *
- * The ring buffer loads batches in index key space/index scan order.
+ * The ring buffer loads batches in index key space/index scan order.  This
+ * allows the table AM to maintain an adequate prefetch distance: prefetching
+ * is thereby able to request table blocks referenced by index pages that are
+ * well ahead of the current scan position's index page.
  *
  * Most functions here are table AM utilities (tableam_util_*), called by
  * table AMs during amgetbatch index scans.  These manage the batch ring
  * buffer's lifecycle and positional state, and help with certain aspects of
  * resource management.  The table AM uses scanPos to return items from
- * batches returned by amgetbatch.
+ * batches returned by amgetbatch.  Table AMs that support I/O prefetching of
+ * table blocks during index scans use prefetchPos to request table blocks
+ * well ahead of those that are of immediate interest to scanPos.
  *
  * There are also some index AM utilities (indexam_util_*), called by index
  * AMs that implement the amgetbatch interface, to help manage resources like
@@ -104,6 +110,7 @@ tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan)
 	bool		markBatchFreed = false;
 
 	batchringbuf->scanPos.valid = false;
+	batchringbuf->prefetchPos.valid = false;
 	batchringbuf->markPos.valid = false;
 
 	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
@@ -215,7 +222,12 @@ tableam_util_batchscan_mark_pos(IndexScanDesc scan)
  * the current scanBatch when needed.
  *
  * We just discard all batches (other than markBatch/restored scanBatch),
- * except when markBatch is already the scan's current scanBatch.
+ * except when markBatch is already the scan's current scanBatch.  We always
+ * invalidate prefetchPos.  The table AM's prefetching state (e.g., its read
+ * stream) is reset by the caller (which calls this function as it resets that
+ * state).  This approach keeps things simple for table AMs: most code that
+ * deals with batches is thereby able to assume that the common case where
+ * scan direction never changes is the only case.
  *
  * Note: This relies on the assumption that we already have a valid scanPos.
  * Table AMs should only call tableam_util_batchscan_reset from within their
@@ -242,6 +254,14 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan)
 	Assert(markPos->item >= markBatch->firstItem &&
 		   markPos->item <= markBatch->lastItem);
 
+	/*
+	 * Restoring a mark always requires stopping prefetching.  This is similar
+	 * to the handling table AMs implement to deal with a tuple-level change
+	 * in the scan's direction.  The read stream must have already been reset
+	 * by the table AM caller.
+	 */
+	batchringbuf->prefetchPos.valid = false;
+
 	if (scanBatch == markBatch)
 	{
 		/* markBatch is already scanBatch; needn't change batchringbuf */
@@ -312,6 +332,13 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan)
  * to determine which batch comes next in the new scan direction.  This
  * approach isn't particularly efficient, but it works well enough for what
  * ought to be a relatively rare occurrence.
+ *
+ * Caller must have reset the scan's read stream before calling here.  That
+ * needs to happen as soon as the scan requests a tuple in whatever scan
+ * direction is opposite-to-current.  We only deal with the case where the
+ * scan backs up by enough items to cross a batch boundary (when the scan
+ * resumes scanning in its original direction/ends before crossing a boundary,
+ * there isn't any need to call here).
  */
 void
 tableam_util_scanbatch_dirchange(IndexScanDesc scan)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1c575e56f..6fcb815f7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -146,6 +146,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index afaa058b0..ace56f7a8 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -941,6 +941,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ac38cddaa..8705dd5f3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -431,6 +431,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fa566c9e5..f7dc013a2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5938,6 +5938,22 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexscan-prefetch" xreflabel="enable_indexscan_prefetch">
+      <term><varname>enable_indexscan_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexscan_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables I/O prefetching during the execution of index
+        scans and index-only scans.  Prefetching can improve performance by
+        reading table AM pages ahead of when they are needed during these
+        scans.  The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 490431f70..7da80d16a 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -883,9 +883,11 @@ amgetbatch (IndexScanDesc scan,
    time (so they can drive a cursor), as opposed to a bitmap scan
    (<function>amgetbitmap</function>), which returns all matches at once.
    Where <function>amgettuple</function> returns one matching entry per call,
-   <function>amgetbatch</function> returns them in batches.  By returning all
-   matching index entries from a single index page together, the table AM gains
-   visibility into which table blocks will be needed in the near future.
+   <function>amgetbatch</function> returns them in batches.  This enables the
+   table access method to optimize table block access patterns and perform
+   I/O prefetching: by returning matching index entries in batches (typically
+   all matches from a single index page), the table AM can read ahead through
+   the index, identify which table blocks will be needed, and prefetch them.
   </para>
 
   <para>
@@ -1052,7 +1054,9 @@ amunguardbatch (IndexScanDesc scan,
     be sure to free the pins at an opportune point (at a minimum whenever
     <function>amendscan</function> is called, and typically when
     <function>amrescan</function> is called).  It must also keep the number of
-    retained pins fixed and small.
+    retained pins fixed and small, to avoid exhausting the backend's buffer
+    pin limit (which is shared with the table AM's read stream for index scan
+    prefetching).
    </para>
   </note>
 
@@ -1597,6 +1601,66 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
+  <sect2 id="index-scanning-batches">
+   <title>Table AM Considerations for Batch Scanning</title>
+
+   <para>
+    This section is primarily relevant to <link linkend="tableam">table access
+     method</link> authors.
+   </para>
+
+   <para>
+    When an index scan uses the <function>amgetbatch</function> interface, the
+    table AM has sole control over the <structname>IndexScanDesc</structname>'s
+    <structfield>batchringbuf</structfield>, including creating, resetting,
+    and ending the batch ring buffer within the appropriate table AM
+    callbacks, and managing positional state and TID recycling interlocking
+    (that is, determining when to unguard each batch, which will typically
+    release an index page buffer pin associated with the batch).  Index access
+    methods should not access or manipulate these fields.
+    <filename>src/include/access/indexbatch.h</filename> provides the
+    <function>tableam_util_*</function> utility functions that table AMs use
+    to manage the ring buffer and its positional state.  See the
+    <filename>src/backend/access/heap/heapam_indexscan.c</filename>
+    implementation for a reference example.
+   </para>
+
+   <para>
+    The <structfield>scanPos</structfield> field within
+    <structfield>batchringbuf</structfield> tracks which batch and item within
+    that batch will be returned next to the executor.  The table AM must advance
+    <structfield>scanPos</structfield> as tuples are returned by
+    <function>table_index_getnext_slot</function> (using
+    <function>tableam_util_scanpos_advance</function>, plus
+    <function>tableam_util_scanpos_nextbatch</function> when crossing batch
+    boundaries), and must also modify this field when restoring a saved mark.
+   </para>
+
+   <para>
+    The <structfield>prefetchPos</structfield> field tracks the position used
+    for I/O prefetching.  It is managed within a read stream callback (using
+    <function>tableam_util_prefetchpos_catchup</function> and
+    <function>tableam_util_prefetchpos_advance</function>), allowing
+    the table AM to prefetch table blocks pointed to by items that are well
+    ahead of the current scan position.  Initially
+    <structfield>prefetchPos</structfield> starts at
+    <structfield>scanPos</structfield>, but as the read stream ramps up it can
+    get far ahead &mdash; spanning multiple index pages if necessary to
+    maintain an optimal I/O prefetch distance for table block reads.  A major
+    goal of the <function>amgetbatch</function> interface is to allow the
+    table AM to prefetch without being limited to items from the current
+    <structfield>scanPos</structfield> batch's index leaf page.
+   </para>
+
+   <para>
+    For details on the TID recycling interlock during batch scans, including
+    the <structfield>batchImmediateUnguard</structfield> policy and the
+    <function>amunguardbatch</function> callback, see
+    <xref linkend="index-locking"/>.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-locking">
@@ -1702,7 +1766,40 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    immediately after scanning the corresponding index entry.  This is
    expensive for a number of reasons.  The
    <function>amgetbatch</function> interface, by contrast, was designed to
-   allow scans to be <quote>asynchronous</quote>.
+   allow scans to be <quote>asynchronous</quote>: by collecting batches of
+   TIDs from multiple index pages, the table AM can prefetch the corresponding
+   table blocks well ahead of the current scan position (using asynchronous
+   I/O when available), allowing a more efficient heap access pattern.  Not
+   all scans end up being asynchronous in practice, but the interface is
+   designed to allow it.  Per the above analysis, we must use the synchronous
+   approach for non-MVCC-compliant snapshots (even when using the
+   <function>amgetbatch</function> interface), but an asynchronous scan is
+   workable for plain index scans that use an MVCC snapshot.
+  </para>
+
+  <para>
+   Because the table AM reads multiple index leaf pages ahead via
+   <function>amgetbatch</function> to facilitate this prefetching, a non-MVCC
+   scan would have to hold the TID recycling interlock across the entire
+   read-ahead window, since it has no heap-visibility backstop to fall back on.
+   That is impractical, so I/O prefetching with
+   <function>amgetbatch</function> is only possible when an MVCC-compliant
+   snapshot is in use.
+  </para>
+
+  <para>
+   With an MVCC snapshot, a plain index scan drops each batch's interlock
+   immediately, since it always visits the heap page, where the snapshot
+   rejects any recycled TID's new occupant.  An index-only scan may instead
+   skip the heap and consult the visibility map, so the table AM holds the
+   batch's interlock pin until it has copied that batch's visibility
+   information out of the visibility map, and then drops it.  Either way, the
+   scan never holds more than one such interlock pin at a time, whether or not
+   prefetching is active &mdash; so in terms of pins held, an index-only scan
+   behaves much like a plain index scan.  That single
+   extra pin is taken and released by the scan itself, outside the prefetching
+   read stream's own pin management; bounding it to one pin is what keeps it
+   from disturbing how the read stream budgets its buffer pins.
   </para>
 
   <para>
diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
index 9ccf5b739..54b5ba2dc 100644
--- a/doc/src/sgml/tableam.sgml
+++ b/doc/src/sgml/tableam.sgml
@@ -129,6 +129,13 @@ my_tableam_handler(PG_FUNCTION_ARGS)
   optional), the block number needs to provide locality.
  </para>
 
+ <para>
+  Table access methods must support index scans that are driven by index
+  access methods implementing the <function>amgetbatch</function> interface.
+  See <xref linkend="index-scanning-batches"/> for details on consuming
+  <function>amgetbatch</function> batches and managing the scan's position.
+ </para>
+
  <para>
   For crash safety, an AM can use postgres' <link
   linkend="wal"><acronym>WAL</acronym></link>, or a custom implementation.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 132b56a58..32bc3dd3e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cbfcde303..191ce1d7c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -266,6 +266,7 @@ BaseBackupTargetHandle
 BaseBackupTargetType
 BatchMVCCState
 BatchMatchingItem
+BatchPosAdvanceResult
 BatchRingBuffer
 BatchRingItemPos
 BeginDirectModify_function
-- 
2.53.0



  [application/octet-stream] v28-0005-WIP-Adopt-amgetbatch-interface-in-GiST-index-AM.patch (110.9K, 7-v28-0005-WIP-Adopt-amgetbatch-interface-in-GiST-index-AM.patch)
  download | inline diff:
From 9dbca43fca102e3684166067e901d2ce10d99d18 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Mon, 1 Jun 2026 19:35:47 -0400
Subject: [PATCH v28 05/11] WIP: Adopt amgetbatch interface in GiST index AM.

Replace gistgettuple with gistgetbatch, a function that implements the
new amgetbatch interface added by commit FIXME.  Plain index scans of
GiST indexes now return matching items in batches consisting of all of
the matches from a given leaf page.  This gives the table AM the ability
to perform optimizations like index prefetching during GiST index scans.

The amgetbatch interface requires that index AMs take the same
standardized approach to pin management for pins that are used to
prevent unsafe concurrent TID recycling by VACUUM (that way prefetching
can hold open multiple batches without it affecting the read stream).
For an ordinary GiST batch this interlock pin is the pin on its single
leaf page, held only for as long as the table AM still needs it as an
interlock (just like during nbtree and hash scans).

Nearest-neighbor (ordered) scans are handled quite differently, because
their matches don't naturally arrive one leaf page at a time.  Here
gistgetbatch instead drains the scan's distance-ordered pairing heap,
packing the matching leaf items into a single "virtual" batch in
distance order, typically spanning many leaf pages.  We're effectively
pretending that the matches we found were in useful order, together on
the same leaf page -- though that isn't really true.  Virtual batches
come with restrictions that make the pretense safe: an ordered scan is
never planned as an index-only scan, and gistkillitemsbatch does nothing
for a virtual batch.  A virtual batch therefore never holds a TID
recycling interlock pin at all; the pin on each underlying leaf page is
instead dropped right away, as the page is scanned into the queue.

The interlock pin also fixes a pre-existing bug in which GiST index-only
scans could return wrong answers [1].  An index-only scan trusts the
visibility map instead of fetching the heap tuple, so it must keep
VACUUM from recycling a heap TID between the moment it reads an index
entry and the moment it consults the visibility map; otherwise it can
report indexed values that belong to an unrelated, since-recycled heap
tuple.  The retained leaf-page buffer pin is that interlock -- but only
if VACUUM honors it.  gistvacuumpage therefore now acquires a cleanup
lock on each page (rather than a plain exclusive lock), so a concurrent
scan's pin holds VACUUM off from recycling that page's TIDs until the
scan has finished its visibility checks.

This same interlock requirement is why ordered scans cannot be
index-only: a virtual batch drops each leaf page's pin as soon as the
page is scanned, so it has no bounded pin to offer as the recycling
interlock that an index-only scan depends on.  Rather than work around
that (which seems prohibitively complicated), the planner never builds
an index-only scan that uses ordering operators; ordered scans must be
plain index scans, which fetch and recheck the heap tuple and so were
never subject to the bug.  This warrants an incompatibility item in the
Postgres 20 release notes (note that both GiST and SP-GiST are affected).

The gistgetbatch implementation makes use of new batch-related core
infrastructure.  GiST now registers an amgettransform callback, which
sets the scan descriptor's per-tuple recheck flags.  It also sets
order-by distances, and reconstructs a heap tuple for index-only scans.
It is called just before table_index_getnext_slot returns another tuple.
Like nbtree, the scan uses a currTuples storage area to store IndexTuple
structs in their original on-disk representation.  Unlike nbtree, GiST
uses amgettransform to convert the representation of the tuples into a
heap tuple representation of the underlying indexed type.  This scheme
also relies on a new facility that allows index AMs to request their own
separate dynamically sized area for supplemental metadata (GiST
opclasses have the ability to represent that any tuple needs a recheck,
so we have to shuttle that information around with the batch).

[1] https://postgr.es/m/CAH2-Wz=jjiNL9FCh8C1L-GUH15f4WFTWub2x+_NucngcDDcHKw@mail.gmail.com

Author: Peter Geoghegan <[email protected]>
---
 src/include/access/amapi.h                    |   5 +
 src/include/access/gist_private.h             |  77 +-
 src/include/access/gistxlog.h                 |  13 +-
 src/include/access/indexbatch.h               |  25 +
 src/include/access/relscan.h                  |   5 +
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/README                |  94 ++-
 src/backend/access/gist/gist.c                |   9 +-
 src/backend/access/gist/gistget.c             | 661 ++++++++++--------
 src/backend/access/gist/gistscan.c            |  45 +-
 src/backend/access/gist/gistutil.c            |   9 +-
 src/backend/access/gist/gistvacuum.c          |  17 +-
 src/backend/access/gist/gistxlog.c            |  37 +-
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam_indexscan.c    |  22 +-
 src/backend/access/index/amapi.c              |   1 +
 src/backend/access/index/genam.c              |   1 +
 src/backend/access/index/indexbatch.c         |   5 +-
 src/backend/access/nbtree/nbtree.c            |   1 +
 src/backend/access/rmgrdesc/gistdesc.c        |   4 +
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/executor/nodeIndexonlyscan.c      |  12 -
 src/backend/optimizer/path/indxpath.c         |   5 +-
 contrib/bloom/blutils.c                       |   1 +
 contrib/btree_gist/expected/cash.out          |   6 +-
 contrib/btree_gist/expected/date.out          |   6 +-
 contrib/btree_gist/expected/float4.out        |   6 +-
 contrib/btree_gist/expected/float8.out        |   2 +-
 contrib/btree_gist/expected/int2.out          |   6 +-
 contrib/btree_gist/expected/int4.out          |   6 +-
 contrib/btree_gist/expected/int8.out          |   2 +-
 contrib/btree_gist/expected/interval.out      |   2 +-
 contrib/btree_gist/expected/time.out          |   2 +-
 contrib/btree_gist/expected/timestamp.out     |   2 +-
 contrib/btree_gist/expected/timestamptz.out   |   2 +-
 doc/src/sgml/indexam.sgml                     | 200 +++++-
 .../modules/dummy_index_am/dummy_index_am.c   |   1 +
 .../modules/index/expected/killtuples.out     |  79 ++-
 src/test/modules/index/specs/killtuples.spec  |  18 +-
 src/test/regress/expected/create_index.out    |  14 +-
 .../regress/expected/create_index_spgist.out  |  18 +-
 src/test/regress/expected/gist.out            |  52 +-
 src/test/regress/sql/gist.sql                 |   8 +-
 src/tools/pgindent/typedefs.list              |   2 +
 45 files changed, 1029 insertions(+), 458 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 02793a115..157c1a8df 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -212,6 +212,10 @@ typedef void (*amunguardbatch_function) (IndexScanDesc scan,
 typedef void (*amkillitemsbatch_function) (IndexScanDesc scan,
 										   IndexScanBatch batch);
 
+/* Set up the scan's xs_hitup output tuple for the given batch item */
+typedef void (*amgettransform_function) (IndexScanDesc scan,
+										 IndexScanBatch batch, int item);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -326,6 +330,7 @@ typedef struct IndexAmRoutine
 	amgetbatch_function amgetbatch; /* can be NULL */
 	amunguardbatch_function amunguardbatch; /* can be NULL */
 	amkillitemsbatch_function amkillitemsbatch; /* can be NULL */
+	amgettransform_function amgettransform; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	amposreset_function amposreset; /* can be NULL */
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 44514f1cb..534e3b4ca 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
@@ -120,10 +121,6 @@ typedef struct GISTSearchHeapItem
 	ItemPointerData heapPtr;
 	bool		recheck;		/* T if quals must be rechecked */
 	bool		recheckDistances;	/* T if distances must be rechecked */
-	HeapTuple	recontup;		/* data reconstructed from the index, used in
-								 * index-only scans */
-	OffsetNumber offnum;		/* track offset in page to mark tuple as
-								 * LP_DEAD */
 } GISTSearchHeapItem;
 
 /* Unvisited item, either index page or heap tuple */
@@ -148,6 +145,55 @@ typedef struct GISTSearchItem
 	(offsetof(GISTSearchItem, distances) + \
 	 sizeof(IndexOrderByDistance) * (n_distances))
 
+/* Per-batch data private to the GiST index AM */
+typedef struct GISTBatchData
+{
+	/* leaf page's buffer pin */
+	Buffer		buf;
+	/* leaf page's block number (InvalidBlockNumber means "virtual" batch) */
+	BlockNumber blkno;
+} GISTBatchData;
+
+/* Access the GiST-private per-batch data from an IndexScanBatch pointer */
+#define GISTBatchGetData(scan, batch) \
+	index_scan_batch_index_opaque_static(scan, batch, GISTBatchData)
+
+/*
+ * Per-item private GiST data.  We lay out the index AM's dynamic opaque area
+ * as an array of these, one per batch item, and subscript it via
+ * GISTBatchGetItem.
+ *
+ * GiST matching is potentially lossy, and the Consistent function's recheck
+ * flag varies from one item to the next, so every batch item records its own
+ * qual recheck flag; gistgettransform reports it as the item's xs_recheck.
+ *
+ * Note: Unordered scans only need a recheck flag, so their dynamic opaque
+ * area is just a bool array, subscripted via GISTBatchGetRecheck.
+ */
+typedef struct GISTBatchItem
+{
+	bool		recheck;		/* T if quals must be rechecked */
+	bool		recheckDistances;	/* T if distances are lossy lower bounds */
+	/* numberOfOrderBys entries */
+	IndexOrderByDistance distances[FLEXIBLE_ARRAY_MEMBER];
+} GISTBatchItem;
+
+#define SizeOfGISTBatchItem(n_distances) \
+	(offsetof(GISTBatchItem, distances) + \
+	 sizeof(IndexOrderByDistance) * (n_distances))
+
+/* Get an item from dynamic area during an ordered scan */
+#define GISTBatchGetItem(scan, batch, item) \
+	(AssertMacro((scan)->numberOfOrderBys > 0), \
+	 AssertMacro((item) >= 0 && (item) < MaxIndexTuplesPerPage), \
+	 (GISTBatchItem *) ((char *) index_scan_batch_index_opaque_dyn((scan), (batch)) + \
+						(Size) (item) * SizeOfGISTBatchItem((scan)->numberOfOrderBys)))
+
+/* Get an item from dynamic area during a non-ordered scan */
+#define GISTBatchGetRecheck(scan, batch) \
+	(AssertMacro((scan)->numberOfOrderBys == 0), \
+	 (bool *) index_scan_batch_index_opaque_dyn((scan), (batch)))
+
 /*
  * GISTScanOpaqueData: private state for a scan of a GiST index
  */
@@ -159,23 +205,9 @@ typedef struct GISTScanOpaqueData
 	pairingheap *queue;			/* queue of unvisited items */
 	MemoryContext queueCxt;		/* context holding the queue */
 	bool		qual_ok;		/* false if qual can never be satisfied */
-	bool		firstCall;		/* true until first gistgettuple call */
 
 	/* pre-allocated workspace arrays */
 	IndexOrderByDistance *distances;	/* output area for gistindex_keytest */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	OffsetNumber *killedItems;	/* offset numbers of killed items */
-	int			numKilled;		/* number of currently stored items */
-	BlockNumber curBlkno;		/* current number of block */
-	GistNSN		curPageLSN;		/* pos in the WAL stream when page was read */
-
-	/* In a non-ordered search, returnable heap items are stored here: */
-	GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
-	OffsetNumber nPageData;		/* number of valid items in array */
-	OffsetNumber curPageData;	/* next item to return */
-	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
-								 * index-only scans */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -448,6 +480,9 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 IndexTuple *itup, int ituplen,
 								 Buffer leftchildbuf);
 
+extern XLogRecPtr gistXLogVacuum(Buffer buffer,
+								 OffsetNumber *todelete, int ntodelete);
+
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 								 int ntodelete, TransactionId snapshotConflictHorizon,
 								 Relation heaprel);
@@ -458,7 +493,11 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								Buffer leftchildbuf, bool markfollowright);
 
 /* gistget.c */
-extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
+extern void gistkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern IndexScanBatch gistgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch,
+								   ScanDirection dir);
+extern void gistunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void gistgettransform(IndexScanDesc scan, IndexScanBatch batch, int item);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool gistcanreturn(Relation index, int attno);
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1c2cf6e81..86e5e1f86 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -18,17 +18,24 @@
 #include "lib/stringinfo.h"
 
 #define XLOG_GIST_PAGE_UPDATE		0x00
-#define XLOG_GIST_DELETE			0x10	/* delete leaf index tuples for a
-											 * page */
+#define XLOG_GIST_DELETE			0x10	/* delete leaf index tuples marked
+											 * as LP_DEAD during normal index
+											 * tuple insertion */
 #define XLOG_GIST_PAGE_REUSE		0x20	/* old page is about to be reused
 											 * from FSM */
 #define XLOG_GIST_PAGE_SPLIT		0x30
- /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
+#define XLOG_GIST_PAGE_VACUUM		0x40	/* delete leaf index tuples during
+											 * VACUUM */
  /* #define XLOG_GIST_CREATE_INDEX		 0x50 */	/* not used anymore */
 #define XLOG_GIST_PAGE_DELETE		0x60
  /* #define XLOG_GIST_ASSIGN_LSN		 0x70 */	/* not used anymore */
 
 /*
+ * Used by both XLOG_GIST_PAGE_UPDATE and XLOG_GIST_PAGE_VACUUM.  VACUUM only
+ * ever deletes tuples (ntoinsert is 0, and there is no left child), but the
+ * page-level changes are otherwise the same; the records differ only in that
+ * replaying a VACUUM record takes a cleanup lock on the target page.
+ *
  * Backup Blk 0: updated page.
  * Backup Blk 1: If this operation completes a page split, by inserting a
  *				 downlink for the split page, the left half of the split
diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h
index 9471a9db5..24b531705 100644
--- a/src/include/access/indexbatch.h
+++ b/src/include/access/indexbatch.h
@@ -101,6 +101,8 @@ index_scan_batch_append(IndexScanDesc scan, IndexScanBatch batch)
  *
  *   [table AM opaque area]    <- table AM area (batch_table_opaque_size),
  *                                optionally requested by table AM
+ *   [index AM dyn opaque]     <- index AM area (batch_index_opaque_dyn),
+ *                                optionally requested by index AM
  *   [index AM static opaque]  <- index AM area (batch_index_opaque_static),
  *                                mandatory fixed-size index AM area
  *   [IndexScanBatchData]      <- batch pointer, returned by amgetbatch
@@ -129,6 +131,17 @@ index_scan_batch_append(IndexScanDesc scan, IndexScanBatch batch)
  * area.  Access to the area is cheap (a compile-time-constant subtraction),
  * but its size cannot vary from scan to scan.  Index AMs typically use this
  * area to store things like index page sibling link block numbers.
+ *
+ * Index AMs can use a second, optional dynamically-sized private area
+ * (batch_index_opaque_dyn) that sits just before the static area.  Its size
+ * is chosen at scan start rather than at compile time.  It is accessed via
+ * index_scan_batch_index_opaque_dyn.  This second area is generally only used
+ * during scans where large amounts of supplemental metadata are required,
+ * that cannot reasonably be allocated for every scan.  Typically, this is
+ * granular information about the batch's items for use by the index AM's
+ * amgettransform routine (the tuples themselves are stored separately, in
+ * on-disk format, in the currTuples workspace; amgettransform converts each
+ * one into the scan's returnable tuple).
  * ----------------------------------------------------------------------------
  */
 
@@ -165,6 +178,18 @@ index_scan_batch_table_area(IndexScanDesc scan, IndexScanBatch batch)
 	(AssertMacro((scan)->batch_index_opaque_static == MAXALIGN(sizeof(type))), \
 	 ((type *) ((char *) (batch) - MAXALIGN(sizeof(type)))))
 
+/*
+ * Return a pointer to the index AM's dynamic opaque area
+ */
+static inline void *
+index_scan_batch_index_opaque_dyn(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Assert(scan->batch_index_opaque_dyn > 0);
+
+	return (char *) batch - scan->batch_index_opaque_static -
+		MAXALIGN(scan->batch_index_opaque_dyn);
+}
+
 /* ----------------------------------------------------------------------------
  * Elementary batch position operations
  * ----------------------------------------------------------------------------
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2f66e367..3a1e616d3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -377,6 +377,11 @@ typedef struct IndexScanDescData
 	 */
 	uint32		batch_table_opaque_size;	/* table AM opaque area size */
 
+	/*
+	 * Optional dynamic opaque size, also set by index AM in ambeginscan
+	 */
+	uint32		batch_index_opaque_dyn;
+
 	/*
 	 * Offset used by index_scan_batch_base (set on first batch alloc).  See
 	 * access/indexbatch.h.
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2d9d04aa3..4799a40b7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -302,6 +302,7 @@ brinhandler(PG_FUNCTION_ARGS)
 		.amgetbatch = NULL,
 		.amunguardbatch = NULL,
 		.amkillitemsbatch = NULL,
+		.amgettransform = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
 		.amposreset = NULL,
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0e8b6a549..ceb9cb447 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -87,6 +87,7 @@ ginhandler(PG_FUNCTION_ARGS)
 		.amgetbatch = NULL,
 		.amunguardbatch = NULL,
 		.amkillitemsbatch = NULL,
+		.amgettransform = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
 		.amposreset = NULL,
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 75445b074..8864a3faf 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -48,7 +48,7 @@ The original algorithms were modified in several ways:
 
 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
   algorithm was considerably changed, because in PostgreSQL the search function
-  should return one tuple (next), not all tuples at once. Also, it should
+  returns matching tuples incrementally, not all at once. Also, it should
   release page locks between calls.
 * Since we added support for variable length keys, it's not possible to
   guarantee enough free space for all keys on pages after splitting. User
@@ -71,20 +71,24 @@ was not touched in the paper.
 Search Algorithm
 ----------------
 
-The search code maintains a queue of unvisited items, where an "item" is
-either a heap tuple known to satisfy the search conditions, or an index
-page that is consistent with the search conditions according to inspection
-of its parent page's downlink item.  Initially the root page is searched
-to find unvisited items in it.  Then we pull items from the queue.  A
-heap tuple pointer is just returned immediately; an index page entry
-causes that page to be searched, generating more queue entries.
+The search code maintains a queue of unvisited items.  For a plain index
+scan an "item" is always an index page that is consistent with the search
+conditions according to inspection of its parent page's downlink item;
+matching heap tuples are not queued, but are gathered into a batch as each
+leaf page is scanned (see "Returning matches in batches", below).  For a
+nearest-neighbor (ordered) scan the queue additionally holds heap tuples
+known to satisfy the search conditions, so that heap tuples and index
+pages can be interleaved in distance order.  Initially the root page is
+added to the queue.  Then we pull items from the queue: an index page
+entry causes that page to be scanned, generating more queue entries, while
+a heap tuple entry (ordered scans only) is a match to be returned.
 
-The queue is kept ordered with heap tuple items at the front, then
-index page entries, with any newly-added index page entry inserted
-before existing index page entries.  This ensures depth-first traversal
-of the index, and in particular causes the first few heap tuples to be
-returned as soon as possible.  That is helpful in case there is a LIMIT
-that requires only a few tuples to be produced.
+The queue is kept ordered so that we perform a depth-first traversal of
+the index: any newly-added index page entry is inserted before existing
+index page entries, and (for ordered scans) heap tuple items are kept at
+the front.  This causes the first few matching heap tuples to be returned
+as soon as possible, which is helpful in case there is a LIMIT that
+requires only a few tuples to be produced.
 
 To implement nearest-neighbor search, the queue entries are augmented
 with distance data: heap tuple entries are labeled with exact distance
@@ -94,17 +98,18 @@ queue entries are retrieved in smallest-distance-first order, with
 entries having identical distances managed as stated in the previous
 paragraph.
 
-The search algorithm keeps an index page locked only long enough to scan
-its entries and queue those that satisfy the search conditions.  Since
-insertions can occur concurrently with searches, it is possible for an
-index child page to be split between the time we make a queue entry for it
-(while visiting its parent page) and the time we actually reach and scan
-the child page.  To avoid missing the entries that were moved to the right
-sibling, we detect whether a split has occurred by comparing the child
-page's NSN (node sequence number, a special-purpose LSN) to the LSN that
-the parent had when visited.  If it did, the sibling page is immediately
-added to the front of the queue, ensuring that its items will be scanned
-in the same order as if they were still on the original child page.
+The search algorithm keeps an index page locked only long enough to scan its
+entries -- queueing the child pages that satisfy the search conditions, and
+gathering any matching heap tuples (into a batch, or onto the queue for an
+ordered scan).  Since insertions can occur concurrently with searches, it is
+possible for an index child page to be split between the time we make a queue
+entry for it (while visiting its parent page) and the time we actually reach
+and scan the child page.  To avoid missing the entries that were moved to the
+right sibling, we detect whether a split has occurred by comparing the child
+page's NSN (node sequence number, a special-purpose LSN) to the LSN that the
+parent had when visited.  If it did, the sibling page is immediately added to
+the front of the queue, ensuring that its items will be scanned in the same
+order as if they were still on the original child page.
 
 As is usual in Postgres, the search algorithm only guarantees to find index
 entries that existed before the scan started; index entries added during
@@ -116,6 +121,36 @@ Any such enlargement would be to add child items that we aren't interested
 in returning anyway.
 
 
+Returning matches in batches
+----------------------------
+
+GiST implements the amgetbatch index AM interface, whose contract is
+documented in doc/src/sgml/indexam.sgml (see also
+src/backend/access/nbtree/README).  Each call hands the table AM a batch of
+matching TIDs rather than a single TID.  GiST forms two kinds of batch:
+
+* A plain (non-ordered) scan returns one "conventional" batch per leaf
+  page, holding all of that page's matching TIDs in physical order.  As in
+  nbtree and hash, the batch retains the leaf page's buffer pin (though not
+  its content lock) as the interlock against concurrent TID recycling by
+  VACUUM.
+
+* A nearest-neighbor (ordered) scan returns a single "virtual" batch.  Its
+  matches don't arrive one leaf page at a time, so instead we drain the
+  distance-ordered queue, copying matching TIDs into the batch in distance
+  order -- typically spanning many leaf pages.  A virtual batch retains no
+  buffer pin; each leaf page's pin is dropped as soon as the page is scanned.
+
+VACUUM honors a batch's pin by taking a cleanup lock on the leaf page (see
+"Bulk delete algorithm (VACUUM)", below), just as nbtree does.  Because a
+virtual batch holds no such pin, ordered scans come with two restrictions,
+both also seen in bitmap (amgetbitmap) scans and both explained in
+doc/src/sgml/indexam.sgml: they never set LP_DEAD bits (gistkillitemsbatch
+does nothing for a virtual batch), and they are never planned as index-only
+scans (a virtual batch has no pin to offer as the TID-recycling interlock
+that index-only scans depend on).
+
+
 Insert Algorithm
 ----------------
 
@@ -452,6 +487,15 @@ B-tree VACUUM uses, but because we already have NSNs on pages, to detect page
 splits during searches, we don't need a "vacuum cycle ID" concept for that
 like B-tree does.
 
+We take a full cleanup lock on every leaf page as we scan it, even leaf
+pages with no deletable tuples.  As in nbtree, this is the interlock that
+holds concurrent scans off from TID recycling; see "Returning matches in
+batches", above.  Replay of the resulting XLOG_GIST_PAGE_VACUUM records
+takes the same cleanup lock, so that the interlock also protects index-only
+scans running on a hot standby.  Recovery only needs the cleanup lock on
+pages that actually have items to delete (the only pages that generate a
+record), not on every leaf page.
+
 While we scan all the pages, we also make note of any completely empty leaf
 pages. We will try to unlink them from the tree after the scan. We also record
 the block numbers of all internal pages; they are needed to locate parents of
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 67b16053a..88b8a4ddf 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = gistadjustmembers,
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
-		.amgettuple = gistgettuple,
-		.amgetbatch = NULL,
-		.amunguardbatch = NULL,
-		.amkillitemsbatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = gistgetbatch,
+		.amunguardbatch = gistunguardbatch,
+		.amkillitemsbatch = gistkillitemsbatch,
+		.amgettransform = gistgettransform,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
 		.amposreset = NULL,
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 4d7c100d7..d6c268084 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -27,84 +27,84 @@
 #include "utils/rel.h"
 
 /*
- * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
- * told us were killed.
- *
- * We re-read page here, so it's important to check page LSN. If the page
- * has been modified since the last read (as determined by LSN), we cannot
- * flag any entries because it is possible that the old entry was vacuumed
- * away and the TID was re-used by a completely different heap tuple.
+ * gistkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
  */
-static void
-gistkillitems(IndexScanDesc scan)
+void
+gistkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
 {
-	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
-	Buffer		buffer;
+	GISTBatchData *gbatch = GISTBatchGetData(scan, batch);
+	Relation	rel = scan->indexRelation;
+	Buffer		buf;
 	Page		page;
-	OffsetNumber offnum;
-	ItemId		iid;
-	int			i;
 	bool		killedsomething = false;
+	XLogRecPtr	latestlsn;
 
-	Assert(so->curBlkno != InvalidBlockNumber);
-	Assert(XLogRecPtrIsValid(so->curPageLSN));
-	Assert(so->killedItems != NULL);
+	Assert(batch->numDead > 0);
 
-	buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
-	if (!BufferIsValid(buffer))
+	/*
+	 * Skip virtual (ordered-scan) batches, since there's no practical way to
+	 * visit all of the index pages that these tuples really came from
+	 */
+	if (gbatch->blkno == InvalidBlockNumber)
 		return;
 
-	LockBuffer(buffer, GIST_SHARE);
-	gistcheckpage(scan->indexRelation, buffer);
-	page = BufferGetPage(buffer);
+	buf = ReadBuffer(rel, gbatch->blkno);
+	LockBuffer(buf, GIST_SHARE);
+	gistcheckpage(rel, buf);
+	page = BufferGetPage(buf);
 
-	/*
-	 * If page LSN differs it means that the page was modified since the last
-	 * read. killedItems could be not valid so LP_DEAD hints applying is not
-	 * safe.
-	 */
-	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
-		goto unlock;
-
-	Assert(GistPageIsLeaf(page));
-
-	/*
-	 * Mark all killedItems as dead. We need no additional recheck, because,
-	 * if page was modified, curPageLSN must have changed.
-	 */
-	for (i = 0; i < so->numKilled; i++)
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
 	{
-		if (!killedsomething)
-		{
-			/*
-			 * Use the hint bit infrastructure to check if we can update the
-			 * page while just holding a share lock. If we are not allowed,
-			 * there's no point continuing.
-			 */
-			if (!BufferBeginSetHintBits(buffer))
-				goto unlock;
-		}
+		/* Modified, give up on hinting */
+		UnlockReleaseBuffer(buf);
+		return;
+	}
 
-		offnum = so->killedItems[i];
-		iid = PageGetItemId(page, offnum);
-		ItemIdMarkDead(iid);
-		killedsomething = true;
+	/* Iterate through batch->deadItems[] in index page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		OffsetNumber offnum = batch->items[itemIndex].indexOffset;
+		ItemId		iid = PageGetItemId(page, offnum);
+
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
+		Assert(i == 0 ||
+			   offnum > batch->items[batch->deadItems[i - 1]].indexOffset);
+		Assert(offnum <= PageGetMaxOffsetNumber(page));
+		Assert(ItemPointerEquals(&((IndexTuple) PageGetItem(page, iid))->t_tid,
+								 &batch->items[itemIndex].tableTid));
+
+		/* Mark index item as dead, if it isn't already */
+		if (!ItemIdIsDead(iid))
+		{
+			if (!killedsomething)
+			{
+				/*
+				 * Use the hint bit infrastructure to check if we can update
+				 * the page while just holding a share lock. If we are not
+				 * allowed, there's no point continuing.
+				 */
+				if (!BufferBeginSetHintBits(buf))
+				{
+					UnlockReleaseBuffer(buf);
+					return;
+				}
+			}
+
+			ItemIdMarkDead(iid);
+			killedsomething = true;
+		}
 	}
 
 	if (killedsomething)
 	{
 		GistMarkPageHasGarbage(page);
-		BufferFinishSetHintBits(buffer, true, true);
+		BufferFinishSetHintBits(buf, true, true);
 	}
 
-unlock:
-	UnlockReleaseBuffer(buffer);
-
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	UnlockReleaseBuffer(buf);
 }
 
 /*
@@ -318,16 +318,25 @@ gistindex_keytest(IndexScanDesc scan,
  * scan: index scan we are executing
  * pageItem: search queue item identifying an index page to scan
  * myDistances: distances array associated with pageItem, or NULL at the root
- * tbm: if not NULL, gistgetbitmap's output bitmap
- * ntids: if not NULL, gistgetbitmap's output tuple counter
+ * newbatch: caller's batch to fill, for a non-ordered scan; NULL when ordered
  *
- * If tbm/ntids aren't NULL, we are doing an amgetbitmap scan, and heap
- * tuples should be reported directly into the bitmap.  If they are NULL,
- * we're doing a plain or ordered indexscan.  For a plain indexscan, heap
- * tuple TIDs are returned into so->pageData[].  For an ordered indexscan,
- * heap tuple TIDs are pushed into individual search queue items.  In an
- * index-only scan, reconstructed index tuples are returned along with the
- * TIDs.
+ * For a non-ordered scan (newbatch isn't NULL, which is the case for both
+ * unordered gistgetbatch and gistgetbitmap), matching item TIDs from a leaf
+ * page are stored into caller's newbatch to return via gistgetbatch.  If we
+ * don't save any items in newbatch, caller needs to find the next leaf page
+ * that has matches and save its items in newbatch instead (if there is none
+ * then caller should release newbatch).
+ *
+ * For an ordered (nearest-neighbor) scan (newbatch is NULL), matching leaf heap
+ * tuples are pushed onto the search queue as GISTSearchItems carrying their
+ * distances, so the queue can later be drained in distance order.  The page's
+ * buffer pin is dropped before returning.  This can only happen during
+ * batchImmediateUnguard scans, which is what makes it safe.  Groups of enqueued
+ * items will eventually be returned (in the expected order) as "virtual
+ * batches", but we don't do that here.
+ *
+ * In all cases, lower index pages are pushed onto the search queue to be
+ * visited later.
  *
  * If we detect that the index page has split since we saw its downlink
  * in the parent, we push its new right sibling onto the queue so the
@@ -335,10 +344,9 @@ gistindex_keytest(IndexScanDesc scan,
  */
 static void
 gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
-			 IndexOrderByDistance *myDistances, TIDBitmap *tbm, int64 *ntids)
+			 IndexOrderByDistance *myDistances, IndexScanBatch newbatch)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
-	GISTSTATE  *giststate = so->giststate;
 	Relation	r = scan->indexRelation;
 	Buffer		buffer;
 	Page		page;
@@ -347,7 +355,12 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	OffsetNumber i;
 	MemoryContext oldcxt;
 
+	/* state used when saving matching items into caller's newbatch */
+	int			itemIndex = 0;
+	int			tupleOffset = 0;
+
 	Assert(!GISTSearchItemIsHeap(*pageItem));
+	Assert((scan->numberOfOrderBys == 0) == (newbatch != NULL));
 
 	buffer = ReadBuffer(scan->indexRelation, pageItem->blkno);
 	LockBuffer(buffer, GIST_SHARE);
@@ -399,22 +412,11 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	 */
 	if (GistPageIsDeleted(page))
 	{
+		Assert(!newbatch || newbatch->firstItem > newbatch->lastItem);
 		UnlockReleaseBuffer(buffer);
 		return;
 	}
 
-	so->nPageData = so->curPageData = 0;
-	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
-	if (so->pageDataCxt)
-		MemoryContextReset(so->pageDataCxt);
-
-	/*
-	 * We save the LSN of the page as we read it, so that we know whether it
-	 * is safe to apply LP_DEAD hints to the page later. This allows us to
-	 * drop the pin for MVCC scans, which allows vacuum to avoid blocking.
-	 */
-	so->curPageLSN = BufferGetLSNAtomic(buffer);
-
 	/*
 	 * check all tuples on page
 	 */
@@ -452,36 +454,28 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 		if (!match)
 			continue;
 
-		if (tbm && GistPageIsLeaf(page))
+		if (scan->numberOfOrderBys == 0 && GistPageIsLeaf(page))
 		{
 			/*
-			 * getbitmap scan, so just push heap tuple TIDs into the bitmap
-			 * without worrying about ordering
+			 * Non-ordered scan (unordered amgetbatch or bitmap), so just
+			 * store another matching item in caller's batch without worrying
+			 * about ordering
 			 */
-			tbm_add_tuples(tbm, &it->t_tid, 1, recheck);
-			(*ntids)++;
-		}
-		else if (scan->numberOfOrderBys == 0 && GistPageIsLeaf(page))
-		{
-			/*
-			 * Non-ordered scan, so report tuples in so->pageData[]
-			 */
-			so->pageData[so->nPageData].heapPtr = it->t_tid;
-			so->pageData[so->nPageData].recheck = recheck;
-			so->pageData[so->nPageData].offnum = i;
+			newbatch->items[itemIndex].tableTid = it->t_tid;
+			newbatch->items[itemIndex].indexOffset = i;
+			newbatch->items[itemIndex].tupleOffset = 0;
+			GISTBatchGetRecheck(scan, newbatch)[itemIndex] = recheck;
 
-			/*
-			 * In an index-only scan, also fetch the data from the tuple.  The
-			 * reconstructed tuples are stored in pageDataCxt.
-			 */
 			if (scan->xs_want_itup)
 			{
-				oldcxt = MemoryContextSwitchTo(so->pageDataCxt);
-				so->pageData[so->nPageData].recontup =
-					gistFetchTuple(giststate, r, it);
-				MemoryContextSwitchTo(oldcxt);
+				/* Copy on-disk format index tuple into currTuples */
+				Size		itupsz = IndexTupleSize(it);
+
+				newbatch->items[itemIndex].tupleOffset = tupleOffset;
+				memcpy(newbatch->currTuples + tupleOffset, it, itupsz);
+				tupleOffset += MAXALIGN(itupsz);
 			}
-			so->nPageData++;
+			itemIndex++;
 		}
 		else
 		{
@@ -500,17 +494,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 
 			if (GistPageIsLeaf(page))
 			{
-				/* Creating heap-tuple GISTSearchItem */
+				/* Creating heap-tuple GISTSearchItem for ordered search */
+				Assert(scan->numberOfOrderBys > 0);
+				Assert(newbatch == NULL);
+				Assert(scan->batchImmediateUnguard);
+
 				item->blkno = InvalidBlockNumber;
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
-
-				/*
-				 * In an index-only scan, also fetch the data from the tuple.
-				 */
-				if (scan->xs_want_itup)
-					item->data.heap.recontup = gistFetchTuple(giststate, r, it);
 			}
 			else
 			{
@@ -535,6 +527,30 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 		}
 	}
 
+	if (newbatch)
+	{
+		/* Finalize result batch during a non-ordered scan */
+		Assert(scan->numberOfOrderBys == 0);
+
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
+
+		if (itemIndex > 0)
+		{
+			GISTBatchData *gnewbatch;
+
+			Assert(GistPageIsLeaf(page));
+
+			gnewbatch = GISTBatchGetData(scan, newbatch);
+			gnewbatch->buf = buffer;
+			gnewbatch->blkno = BufferGetBlockNumber(buffer);
+
+			indexam_util_unlock_batch(scan, newbatch, buffer);
+			return;
+		}
+		/* else caller needs to find another page to fill newbatch */
+	}
+
 	UnlockReleaseBuffer(buffer);
 }
 
@@ -563,22 +579,111 @@ getNextGISTSearchItem(GISTScanOpaque so)
 }
 
 /*
- * Fetch next heap tuple in an ordered search
+ * gistScanStart() -- begin a scan by queueing its root page
+ *
+ * Called on the first amgetbatch/amgetbitmap call of a scan (the caller having
+ * already checked that the qual is satisfiable).  Counts the scan for stats and
+ * queues the root page as the first work item, so the scan drivers are
+ * otherwise pure queue drainers.  The root carries a zeroed parentlsn (it has
+ * no parent, so gistScanPage's split-detection is a no-op for it) and zeroed
+ * distances (so it sorts first in an ordered scan).
+ *
+ * Starting the scan here, rather than in gistrescan, follows the convention
+ * that amrescan only sets up scan keys while the scan proper (counting it,
+ * reading index pages) begins on the first fetch.
  */
-static bool
-getNextNearest(IndexScanDesc scan)
+static void
+gistScanStart(IndexScanDesc scan)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
-	bool		res = false;
+	GISTSearchItem *root;
+	MemoryContext oldcxt;
 
-	if (scan->xs_hitup)
+	pgstat_count_index_scan(scan->indexRelation);
+	if (scan->instrument)
+		scan->instrument->nsearches++;
+
+	oldcxt = MemoryContextSwitchTo(so->queueCxt);
+	root = palloc(SizeOfGISTSearchItem(scan->numberOfOrderBys));
+	root->blkno = GIST_ROOT_BLKNO;
+	memset(&root->data.parentlsn, 0, sizeof(GistNSN));
+	memset(root->distances, 0,
+		   sizeof(root->distances[0]) * scan->numberOfOrderBys);
+	pairingheap_add(so->queue, &root->phNode);
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * getNextBatch() -- read the next leaf page with matches into a fresh batch
+ *
+ * gistgetbatch's non-ordered walker, also driven by gistgetbitmap.  Allocates a
+ * batch and drains the queue, scanning each queued index page until one
+ * produces matching leaf items, then returns that batch.  When the queue is
+ * exhausted without a match, releases the batch and returns NULL.
+ */
+static IndexScanBatch
+getNextBatch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch = indexam_util_alloc_batch(scan);
+
+	/* GiST only ever scans forward; set the batch's direction up front */
+	newbatch->dir = ForwardScanDirection;
+
+	for (;;)
 	{
-		/* free previously returned tuple */
-		pfree(scan->xs_hitup);
-		scan->xs_hitup = NULL;
+		GISTSearchItem *item = getNextGISTSearchItem(so);
+
+		if (item == NULL)
+		{
+			/* No more index pages to scan; the scan is exhausted */
+			indexam_util_release_batch(scan, newbatch);
+			return NULL;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Scan this queued index page; matching leaf items go into the batch */
+		gistScanPage(scan, item, item->distances, newbatch);
+		pfree(item);
+
+		/* If this leaf page produced matching items, return the batch */
+		if (newbatch->firstItem <= newbatch->lastItem)
+			return newbatch;
 	}
 
-	do
+	pg_unreachable();
+
+	return NULL;
+}
+
+/*
+ * getNextNearestBatch() -- drain the queue into a fresh batch in distance order
+ *
+ * gistgetbatch's ordered (nearest-neighbor) walker.  The pairing-heap queue
+ * (so->queue) holds both unvisited index pages and matching leaf heap tuples,
+ * ordered by (lower-bound) distance.  We pop items in that order, dispatching
+ * on the item type.  A popped heap tuple is appended to the batch.  We stop
+ * once the batch is full (maxitemsbatch items) or the queue is exhausted,
+ * leaving any remaining items queued for the next call.
+ *
+ * Because the queue is drained in nondecreasing distance order across the whole
+ * scan (a downlink's distance is a lower bound on its subtree, so items pushed
+ * while scanning a page never sort ahead of items already popped), the
+ * batches we emit are globally distance-ordered.
+ */
+static IndexScanBatch
+getNextNearestBatch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch = indexam_util_alloc_batch(scan);
+	GISTBatchData *gnewbatch;
+	int			nitems = 0;
+
+	/* GiST only ever scans forward; set the batch's direction up front */
+	newbatch->dir = ForwardScanDirection;
+
+	for (;;)
 	{
 		GISTSearchItem *item = getNextGISTSearchItem(so);
 
@@ -588,37 +693,67 @@ getNextNearest(IndexScanDesc scan)
 		if (GISTSearchItemIsHeap(*item))
 		{
 			/* found a heap item at currently minimal distance */
-			scan->xs_heaptid = item->data.heap.heapPtr;
-			scan->xs_recheck = item->data.heap.recheck;
+			GISTBatchItem *bitem = GISTBatchGetItem(scan, newbatch, nitems);
 
-			index_store_float8_orderby_distances(scan, so->orderByTypes,
-												 item->distances,
-												 item->data.heap.recheckDistances);
+			newbatch->items[nitems].tableTid = item->data.heap.heapPtr;
+			newbatch->items[nitems].indexOffset = -1;	/* meaningless here */
+			newbatch->items[nitems].tupleOffset = 0;
 
-			/* in an index-only scan, also return the reconstructed tuple. */
-			if (scan->xs_want_itup)
-				scan->xs_hitup = item->data.heap.recontup;
-			res = true;
+			bitem->recheck = item->data.heap.recheck;
+			bitem->recheckDistances = item->data.heap.recheckDistances;
+			memcpy(bitem->distances, item->distances,
+				   sizeof(item->distances[0]) * scan->numberOfOrderBys);
+
+			nitems++;
+			pfree(item);
+
+			if (nitems == scan->maxitemsbatch)
+				break;			/* batch full; remaining items stay queued */
 		}
 		else
 		{
 			/* visit an index page, extract its items into queue */
 			CHECK_FOR_INTERRUPTS();
 
-			gistScanPage(scan, item, item->distances, NULL, NULL);
+			gistScanPage(scan, item, item->distances, NULL);
+			pfree(item);
 		}
+	}
 
-		pfree(item);
-	} while (!res);
+	if (nitems == 0)
+	{
+		/* No matching items remain: the scan is exhausted */
+		indexam_util_release_batch(scan, newbatch);
+		return NULL;
+	}
 
-	return res;
+	/*
+	 * An ordered batch is "virtual": its items come from many leaf pages,
+	 * whose pins gistScanPage already dropped, so it holds no TID recycling
+	 * interlock.  It has no single originating page, and we don't track those
+	 * index pages in any case (gistkillitemsbatch will just skip it).
+	 */
+	Assert(!newbatch->isGuarded);
+
+	newbatch->firstItem = 0;
+	newbatch->lastItem = nitems - 1;
+
+	gnewbatch = GISTBatchGetData(scan, newbatch);
+	gnewbatch->buf = InvalidBuffer;
+	gnewbatch->blkno = InvalidBlockNumber;
+
+	return newbatch;
 }
 
 /*
- * gistgettuple() -- Get the next tuple in the scan
+ * gistgetbatch() -- Get the first or next batch of items in a scan
+ *
+ * Dispatches to the ordered or non-ordered walker.  Persistent traversal state
+ * lives in so->queue, so priorbatch is unused except to recognize the scan's
+ * first call, when we queue the root page (gistScanStart).
  */
-bool
-gistgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+gistgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 
@@ -626,124 +761,111 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 		elog(ERROR, "GiST only supports forward scan direction");
 
 	if (!so->qual_ok)
-		return false;
+		return NULL;
 
-	if (so->firstCall)
-	{
-		/* Begin the scan by processing the root page */
-		GISTSearchItem fakeItem;
-
-		pgstat_count_index_scan(scan->indexRelation);
-		if (scan->instrument)
-			scan->instrument->nsearches++;
-
-		so->firstCall = false;
-		so->curPageData = so->nPageData = 0;
-		scan->xs_hitup = NULL;
-		if (so->pageDataCxt)
-			MemoryContextReset(so->pageDataCxt);
-
-		fakeItem.blkno = GIST_ROOT_BLKNO;
-		memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
-		gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
-	}
+	if (priorbatch == NULL)
+		gistScanStart(scan);
 
+	if (scan->numberOfOrderBys > 0)
+		return getNextNearestBatch(scan);
+
+	return getNextBatch(scan);
+}
+
+/*
+ * gistunguardbatch() -- Drop a batch's TID recycling interlock (buffer pin)
+ *
+ * Called by the table AM when it's safe to drop the buffer pin held to
+ * prevent concurrent TID recycling by VACUUM.
+ */
+void
+gistunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	GISTBatchData *gbatch = GISTBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(gbatch->buf);
+}
+
+/*
+ * gistgettransform() -- Set up the scan's per-tuple output for one batch item
+ *
+ * Implements the amgettransform interface.  The table AM calls this as it
+ * returns each item of a GiST scan, to set the scan descriptor's per-tuple
+ * output from the item's per-item data.
+ *
+ *   - We always apply the item's qual recheck flag to scan->xs_recheck.
+ *   - For ordered scans, we report the item's own ORDER BY distances (stored in
+ *     the per-item index AM area by getNextNearestBatch) as xs_orderbyvals.
+ *     They are flagged for recheck only when the distance function was lossy
+ *     for that item; an exact distance is reported as final, while a lossy
+ *     lower bound is rechecked by the executor's reorder queue to recompute
+ *     the true order.
+ *   - For index-only scans, we reconstruct the originally indexed values from
+ *     the stored on-disk index tuple into a heap tuple, exposed as xs_hitup.
+ *
+ * The reconstructed tuple lives in the scan's memory context and only needs to
+ * outlive a single table_index_getnext_slot call (the executor copies it into
+ * the scan slot).  We free the previously returned tuple before building the
+ * next one.
+ */
+void
+gistgettransform(IndexScanDesc scan, IndexScanBatch batch, int item)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	Assert(item >= batch->firstItem && item <= batch->lastItem);
+
+	/* Ordered scan (must be a plain index scan) */
 	if (scan->numberOfOrderBys > 0)
 	{
-		/* Must fetch tuples in strict distance order */
-		return getNextNearest(scan);
+		GISTBatchItem *bitem = GISTBatchGetItem(scan, batch, item);
+
+		Assert(!scan->xs_want_itup);
+
+		/* Apply this item's qual recheck flag */
+		scan->xs_recheck = bitem->recheck;
+
+		/*
+		 * Note: This is a "virtual" batch.  The items from caller's batch
+		 * were stored in the batch in distance order by getNextNearestBatch,
+		 * right before gistgetbatch returned it.
+		 */
+		Assert(GISTBatchGetData(scan, batch)->blkno == InvalidBlockNumber);
+		index_store_float8_orderby_distances(scan, so->orderByTypes,
+											 bitem->distances,
+											 bitem->recheckDistances);
+		return;
 	}
-	else
+
+	/*
+	 * Unordered scan.
+	 *
+	 * Always uses simple bool array for item recheck flags.
+	 */
+	scan->xs_recheck = GISTBatchGetRecheck(scan, batch)[item];
+
+	/* Index-only scan */
+	if (scan->xs_want_itup)
 	{
-		/* Fetch tuples index-page-at-a-time */
-		for (;;)
+		/* Reconstruct a returnable heap tuple from stashed index tuple */
+		IndexTuple	itup = (IndexTuple) (batch->currTuples +
+										 batch->items[item].tupleOffset);
+		MemoryContext oldcxt;
+
+		if (scan->xs_hitup)
 		{
-			if (so->curPageData < so->nPageData)
-			{
-				if (scan->kill_prior_tuple && so->curPageData > 0)
-				{
-
-					if (so->killedItems == NULL)
-					{
-						MemoryContext oldCxt =
-							MemoryContextSwitchTo(so->giststate->scanCxt);
-
-						so->killedItems =
-							(OffsetNumber *) palloc(MaxIndexTuplesPerPage
-													* sizeof(OffsetNumber));
-
-						MemoryContextSwitchTo(oldCxt);
-					}
-					if (so->numKilled < MaxIndexTuplesPerPage)
-						so->killedItems[so->numKilled++] =
-							so->pageData[so->curPageData - 1].offnum;
-				}
-				/* continuing to return tuples from a leaf page */
-				scan->xs_heaptid = so->pageData[so->curPageData].heapPtr;
-				scan->xs_recheck = so->pageData[so->curPageData].recheck;
-
-				/* in an index-only scan, also return the reconstructed tuple */
-				if (scan->xs_want_itup)
-					scan->xs_hitup = so->pageData[so->curPageData].recontup;
-
-				so->curPageData++;
-
-				return true;
-			}
-
-			/*
-			 * Check the last returned tuple and add it to killedItems if
-			 * necessary
-			 */
-			if (scan->kill_prior_tuple
-				&& so->curPageData > 0
-				&& so->curPageData == so->nPageData)
-			{
-
-				if (so->killedItems == NULL)
-				{
-					MemoryContext oldCxt =
-						MemoryContextSwitchTo(so->giststate->scanCxt);
-
-					so->killedItems =
-						(OffsetNumber *) palloc(MaxIndexTuplesPerPage
-												* sizeof(OffsetNumber));
-
-					MemoryContextSwitchTo(oldCxt);
-				}
-				if (so->numKilled < MaxIndexTuplesPerPage)
-					so->killedItems[so->numKilled++] =
-						so->pageData[so->curPageData - 1].offnum;
-			}
-			/* find and process the next index page */
-			do
-			{
-				GISTSearchItem *item;
-
-				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
-					gistkillitems(scan);
-
-				item = getNextGISTSearchItem(so);
-
-				if (!item)
-					return false;
-
-				CHECK_FOR_INTERRUPTS();
-
-				/* save current item BlockNumber for next gistkillitems() call */
-				so->curBlkno = item->blkno;
-
-				/*
-				 * While scanning a leaf page, ItemPointers of matching heap
-				 * tuples are stored in so->pageData.  If there are any on
-				 * this page, we fall out of the inner "do" and loop around to
-				 * return them.
-				 */
-				gistScanPage(scan, item, item->distances, NULL, NULL);
-
-				pfree(item);
-			} while (so->nPageData == 0);
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
 		}
+
+		/* reconstruct the originally indexed values as a heap tuple */
+		oldcxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+		scan->xs_hitup = gistFetchTuple(so->giststate, scan->indexRelation, itup);
+		MemoryContextSwitchTo(oldcxt);
 	}
 }
 
@@ -755,41 +877,34 @@ gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 	int64		ntids = 0;
-	GISTSearchItem fakeItem;
+	IndexScanBatch batch;
 
 	if (!so->qual_ok)
 		return 0;
 
-	pgstat_count_index_scan(scan->indexRelation);
-	if (scan->instrument)
-		scan->instrument->nsearches++;
-
-	/* Begin the scan by processing the root page */
-	so->curPageData = so->nPageData = 0;
-	scan->xs_hitup = NULL;
-	if (so->pageDataCxt)
-		MemoryContextReset(so->pageDataCxt);
-
-	fakeItem.blkno = GIST_ROOT_BLKNO;
-	memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
-	gistScanPage(scan, &fakeItem, NULL, tbm, &ntids);
+	/* Begin the scan by queueing the root page */
+	gistScanStart(scan);
 
 	/*
-	 * While scanning a leaf page, ItemPointers of matching heap tuples will
-	 * be stored directly into tbm, so we don't need to deal with them here.
+	 * Drive the same non-ordered walker as gistgetbatch, one leaf page at a
+	 * time, draining each batch into the bitmap and releasing it before
+	 * fetching the next, so only one batch is ever live (cf. spggetbitmap).
 	 */
-	for (;;)
+	while ((batch = getNextBatch(scan)) != NULL)
 	{
-		GISTSearchItem *item = getNextGISTSearchItem(so);
+		bool	   *recheck = GISTBatchGetRecheck(scan, batch);
 
-		if (!item)
-			break;
+		for (int i = batch->firstItem; i <= batch->lastItem; i++)
+		{
+			tbm_add_tuples(tbm, &batch->items[i].tableTid, 1, recheck[i]);
+			ntids++;
+		}
 
-		CHECK_FOR_INTERRUPTS();
-
-		gistScanPage(scan, item, item->distances, tbm, &ntids);
-
-		pfree(item);
+		/*
+		 * Return the batch to the single-slot bitmap cache, to be reused by
+		 * the next getNextBatch
+		 */
+		indexam_util_release_batch(scan, batch);
 	}
 
 	return ntids;
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index c65f93abd..3ec405379 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -104,12 +104,34 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
 		scan->xs_orderbyvals = palloc0_array(Datum, scan->numberOfOrderBys);
 		scan->xs_orderbynulls = palloc_array(bool, scan->numberOfOrderBys);
 		memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
-	}
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-	so->curBlkno = InvalidBlockNumber;
-	so->curPageLSN = InvalidXLogRecPtr;
+		/*
+		 * Ordered scans fill a "virtual" batch by draining the
+		 * distance-ordered queue, so the batch size is a tuning knob with no
+		 * natural value. Testing has shown that a very small size will
+		 * increase per-batch overhead (and likely instruction-cache misses),
+		 * while a large size (such as MaxIndexTuplesPerPage) risks producing
+		 * many tuples that a LIMIT node never consumes.  This maxitemsbatch
+		 * is a compromise.
+		 */
+		scan->maxitemsbatch = MaxIndexTuplesPerPage / 32;
+	}
+	else
+		scan->maxitemsbatch = MaxIndexTuplesPerPage;
+
+	scan->batch_index_opaque_static = MAXALIGN(sizeof(GISTBatchData));
+
+	/*
+	 * Use second opaque area for our per-item data: a GISTBatchItem array
+	 * (with room for each item's ORDER BY distances) for ordered scans, or
+	 * just an array of qual recheck flags for unordered scans
+	 */
+	if (scan->numberOfOrderBys > 0)
+		scan->batch_index_opaque_dyn =
+			SizeOfGISTBatchItem(scan->numberOfOrderBys) * scan->maxitemsbatch;
+	else
+		scan->batch_index_opaque_dyn = sizeof(bool) * scan->maxitemsbatch;
+	scan->batch_tuples_workspace = BLCKSZ;
 
 	scan->opaque = so;
 
@@ -168,8 +190,7 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 
 	/*
 	 * If we're doing an index-only scan, on the first call, also initialize a
-	 * tuple descriptor to represent the returned index tuples and create a
-	 * memory context to hold them during the scan.
+	 * tuple descriptor to represent the returned index tuples.
 	 */
 	if (scan->xs_want_itup && !scan->xs_hitupdesc)
 	{
@@ -203,11 +224,6 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 		}
 		TupleDescFinalize(so->giststate->fetchTupdesc);
 		scan->xs_hitupdesc = so->giststate->fetchTupdesc;
-
-		/* Also create a memory context that will hold the returned tuples */
-		so->pageDataCxt = AllocSetContextCreate(so->giststate->scanCxt,
-												"GiST page data context",
-												ALLOCSET_DEFAULT_SIZES);
 	}
 
 	/* create new, empty pairing heap for search queue */
@@ -215,8 +231,6 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 	so->queue = pairingheap_allocate(pairingheap_GISTSearchItem_cmp, scan);
 	MemoryContextSwitchTo(oldCxt);
 
-	so->firstCall = true;
-
 	/* Update scan key, if a new one is given */
 	if (key && scan->numberOfKeys > 0)
 	{
@@ -340,7 +354,8 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 			pfree(fn_extras);
 	}
 
-	/* any previous xs_hitup will have been pfree'd in context resets above */
+	if (scan->xs_hitup)
+		pfree(scan->xs_hitup);
 	scan->xs_hitup = NULL;
 }
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 0f58f6187..a687718e7 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -670,6 +671,7 @@ gistFetchTuple(GISTSTATE *giststate, Relation r, IndexTuple tuple)
 	Datum		fetchatt[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	int			i;
+	HeapTuple	htup;
 
 	for (i = 0; i < IndexRelationGetNumberOfKeyAttributes(r); i++)
 	{
@@ -717,7 +719,12 @@ gistFetchTuple(GISTSTATE *giststate, Relation r, IndexTuple tuple)
 	}
 	MemoryContextSwitchTo(oldcxt);
 
-	return heap_form_tuple(giststate->fetchTupdesc, fetchatt, isnull);
+	htup = heap_form_tuple(giststate->fetchTupdesc, fetchatt, isnull);
+
+	/* cleanup */
+	MemoryContextReset(giststate->tempCxt);
+
+	return htup;
 }
 
 float
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 686a04180..6b8dc2178 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -326,10 +326,17 @@ restart:
 	recurse_to = InvalidBlockNumber;
 
 	/*
-	 * We are not going to stay here for a long time, aggressively grab an
-	 * exclusive lock.
+	 * Get a full cleanup lock on this page.  We must get such a lock on every
+	 * leaf page over the course of the vacuum scan, whether or not it
+	 * actually contains any deletable tuples.
+	 *
+	 * Note: we could avoid this for internal pages, but not for the root
+	 * page.  The root page can start out as a leaf page, but subsequently
+	 * become an internal page, even while a scan holds an interlock pin on
+	 * that page (this isn't possible in nbtree because root splits always
+	 * create a new root page, stored within a separate block number).
 	 */
-	LockBuffer(buffer, GIST_EXCLUSIVE);
+	LockBufferForCleanup(buffer);
 	page = BufferGetPage(buffer);
 
 	if (gistPageRecyclable(page))
@@ -407,9 +414,7 @@ restart:
 			{
 				XLogRecPtr	recptr;
 
-				recptr = gistXLogUpdate(buffer,
-										todelete, ntodelete,
-										NULL, 0, InvalidBuffer);
+				recptr = gistXLogVacuum(buffer, todelete, ntodelete);
 				PageSetLSN(page, recptr);
 			}
 			else
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index ae538dc81..f9f651261 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -67,14 +67,15 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
  * redo any page update (except page split)
  */
 static void
-gistRedoPageUpdateRecord(XLogReaderState *record)
+gistRedoPageUpdateRecord(XLogReaderState *record, bool get_cleanup_lock)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	gistxlogPageUpdate *xldata = (gistxlogPageUpdate *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
 
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, get_cleanup_lock,
+									  &buffer) == BLK_NEEDS_REDO)
 	{
 		char	   *begin;
 		char	   *data;
@@ -407,7 +408,10 @@ gist_redo(XLogReaderState *record)
 	switch (info)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
-			gistRedoPageUpdateRecord(record);
+			gistRedoPageUpdateRecord(record, false);
+			break;
+		case XLOG_GIST_PAGE_VACUUM:
+			gistRedoPageUpdateRecord(record, true);
 			break;
 		case XLOG_GIST_DELETE:
 			gistRedoDeleteRecord(record);
@@ -637,6 +641,33 @@ gistXLogUpdate(Buffer buffer,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a VACUUM deletion of leaf index tuples.
+ *
+ * This uses the same on-page representation as gistXLogUpdate() (the deletion
+ * of a set of items from a single leaf page), but is logged under a distinct
+ * record type so that replay knows to take a cleanup lock on the target page.
+ */
+XLogRecPtr
+gistXLogVacuum(Buffer buffer, OffsetNumber *todelete, int ntodelete)
+{
+	gistxlogPageUpdate xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.ntodelete = ntodelete;
+	xlrec.ntoinsert = 0;
+
+	XLogBeginInsert();
+	XLogRegisterData(&xlrec, sizeof(gistxlogPageUpdate));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBufData(0, todelete, sizeof(OffsetNumber) * ntodelete);
+
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_VACUUM);
+
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a delete of leaf index tuples marked as DEAD
  * during new tuple insertion.  One may think that this case is already covered
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 76e3193d9..103a0833b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -118,6 +118,7 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amgetbatch = hashgetbatch,
 		.amunguardbatch = hashunguardbatch,
 		.amkillitemsbatch = hashkillitemsbatch,
+		.amgettransform = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 323c245cd..5f04041df 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -956,11 +956,23 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan,
 								BatchRingItemPos *scanPos,
 								bool *all_visible)
 {
+	amgettransform_function amgettransform =
+		scan->indexRelation->rd_indam->amgettransform;
 	HeapBatchData *hbatch;
 
 	/* Set xs_heaptid, which caller (and core executor) will need */
 	scan->xs_heaptid = scanBatch->items[scanPos->item].tableTid;
 
+	/*
+	 * Let the index AM set this item's per-tuple output.  An AM that provides
+	 * amgettransform uses it to set the item's qual recheck flag
+	 * (scan->xs_recheck), an ordered scan's ORDER BY distances
+	 * (xs_orderbyvals/xs_recheckorderby), and an index-only scan's returnable
+	 * tuple (xs_hitup).
+	 */
+	if (amgettransform != NULL)
+		amgettransform(scan, scanBatch, scanPos->item);
+
 	if (all_visible == NULL)
 	{
 		/*
@@ -973,8 +985,14 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan,
 	/* Index-only scan */
 	Assert(scan->xs_want_itup);
 
-	scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
-								  scanBatch->items[scanPos->item].tupleOffset);
+	/*
+	 * Unless the index AM already produced the returnable tuple via
+	 * amgettransform above (in xs_hitup), set the original index tuple that
+	 * amgetbatch stored in currTuples in xs_itup.
+	 */
+	if (amgettransform == NULL)
+		scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
+									  scanBatch->items[scanPos->item].tupleOffset);
 
 	/*
 	 * Set visibility info for the current scanPos item (plus possibly some
diff --git a/src/backend/access/index/amapi.c b/src/backend/access/index/amapi.c
index d4adbbeb2..9886f49ff 100644
--- a/src/backend/access/index/amapi.c
+++ b/src/backend/access/index/amapi.c
@@ -58,6 +58,7 @@ GetIndexAmRoutine(Oid amhandler)
 	/* Assert that AM doesn't have an invalid combination of callbacks */
 	Assert((routine->amgetbatch != NULL) == (routine->amunguardbatch != NULL));
 	Assert(routine->amkillitemsbatch == NULL || routine->amgetbatch != NULL);
+	Assert(routine->amgettransform == NULL || routine->amgetbatch != NULL);
 	Assert(routine->amgetbatch != NULL || routine->amposreset == NULL);
 
 	return routine;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index ca9bae803..1927faeab 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -133,6 +133,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->batch_index_opaque_static = 0;
 	scan->batch_tuples_workspace = 0;
 	scan->batch_table_opaque_size = 0;
+	scan->batch_index_opaque_dyn = 0;
 	scan->batch_base_offset = 0;
 
 	scan->xs_name_cstring_attnums = NULL;
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index e58e09897..2e2ccf6a9 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -632,6 +632,7 @@ indexam_util_alloc_batch(IndexScanDesc scan)
 		{
 			/* We lazily compute batch_base_offset on scan's first call */
 			size_t		table_area = 0;
+			size_t		index_dyn_area = MAXALIGN(scan->batch_index_opaque_dyn);
 
 			if (scan->usebatchring)
 			{
@@ -642,8 +643,8 @@ indexam_util_alloc_batch(IndexScanDesc scan)
 				table_area = MAXALIGN(scan->batch_table_opaque_size);
 			}
 
-			/* ...though we always need an index AM area */
-			scan->batch_base_offset = table_area +
+			/* ...though we always need index AM areas */
+			scan->batch_base_offset = table_area + index_dyn_area +
 				scan->batch_index_opaque_static;
 		}
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b83926f9f..6ace65508 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -166,6 +166,7 @@ bthandler(PG_FUNCTION_ARGS)
 		.amgetbatch = btgetbatch,
 		.amunguardbatch = btunguardbatch,
 		.amkillitemsbatch = btkillitemsbatch,
+		.amgettransform = NULL,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
 		.amposreset = btposreset,
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 67789e025..021f72fa0 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -66,6 +66,7 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 	switch (info)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
+		case XLOG_GIST_PAGE_VACUUM:
 			out_gistxlogPageUpdate(buf, (gistxlogPageUpdate *) rec);
 			break;
 		case XLOG_GIST_PAGE_REUSE:
@@ -93,6 +94,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_PAGE_UPDATE:
 			id = "PAGE_UPDATE";
 			break;
+		case XLOG_GIST_PAGE_VACUUM:
+			id = "PAGE_VACUUM";
+			break;
 		case XLOG_GIST_DELETE:
 			id = "DELETE";
 			break;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 745435da3..47153b4b0 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -92,6 +92,7 @@ spghandler(PG_FUNCTION_ARGS)
 		.amgetbatch = NULL,
 		.amunguardbatch = NULL,
 		.amkillitemsbatch = NULL,
+		.amgettransform = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
 		.amposreset = NULL,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 84a97b71d..a6a8a96e7 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -132,18 +132,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			}
 		}
 
-		/*
-		 * We don't currently support rechecking ORDER BY distances.  (In
-		 * principle, if the index can support retrieval of the originally
-		 * indexed value, it should be able to produce an exact distance
-		 * calculation too.  So it's not clear that adding code here for
-		 * recheck/re-sort would be worth the trouble.  But we should at least
-		 * throw an error if someone tries it.)
-		 */
-		if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("lossy distance functions are not supported in index-only scans")));
 		return slot;
 	}
 
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 94fedf32c..624b6d0f8 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -951,9 +951,12 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 3. Check if an index-only scan is possible.  If we're not building
 	 * plain indexscans, this isn't relevant since bitmap scans don't support
-	 * index data retrieval anyway.
+	 * index data retrieval anyway.  If there are ordering operators then we
+	 * assume that an index-only scan is unsafe due to the difficulty with
+	 * holding index page pins sufficient to avoid concurrent TID recycling.
 	 */
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
+					   orderbyclauses == NIL &&
 					   check_index_only(rel, index));
 
 	/*
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 249af48e6..168842bc7 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -150,6 +150,7 @@ blhandler(PG_FUNCTION_ARGS)
 		.amgetbatch = NULL,
 		.amunguardbatch = NULL,
 		.amkillitemsbatch = NULL,
+		.amgettransform = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
 		.amposreset = NULL,
diff --git a/contrib/btree_gist/expected/cash.out b/contrib/btree_gist/expected/cash.out
index 7fbc73559..56fd1eb49 100644
--- a/contrib/btree_gist/expected/cash.out
+++ b/contrib/btree_gist/expected/cash.out
@@ -74,10 +74,10 @@ SELECT count(*) FROM moneytmp WHERE a >  '22649.64'::money;
 
 EXPLAIN (COSTS OFF)
 SELECT a, a <-> '21472.79' FROM moneytmp ORDER BY a <-> '21472.79' LIMIT 3;
-                    QUERY PLAN                    
---------------------------------------------------
+                  QUERY PLAN                   
+-----------------------------------------------
  Limit
-   ->  Index Only Scan using moneyidx on moneytmp
+   ->  Index Scan using moneyidx on moneytmp
          Order By: (a <-> '$21,472.79'::money)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/date.out b/contrib/btree_gist/expected/date.out
index 5db864bb8..4a360bea6 100644
--- a/contrib/btree_gist/expected/date.out
+++ b/contrib/btree_gist/expected/date.out
@@ -74,10 +74,10 @@ SELECT count(*) FROM datetmp WHERE a >  '2001-02-13'::date;
 
 EXPLAIN (COSTS OFF)
 SELECT a, a <-> '2001-02-13' FROM datetmp ORDER BY a <-> '2001-02-13' LIMIT 3;
-                   QUERY PLAN                   
-------------------------------------------------
+                  QUERY PLAN                  
+----------------------------------------------
  Limit
-   ->  Index Only Scan using dateidx on datetmp
+   ->  Index Scan using dateidx on datetmp
          Order By: (a <-> '02-13-2001'::date)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/float4.out b/contrib/btree_gist/expected/float4.out
index dfe732049..8878a317c 100644
--- a/contrib/btree_gist/expected/float4.out
+++ b/contrib/btree_gist/expected/float4.out
@@ -74,10 +74,10 @@ SELECT count(*) FROM float4tmp WHERE a >  -179.0::float4;
 
 EXPLAIN (COSTS OFF)
 SELECT a, a <-> '-179.0' FROM float4tmp ORDER BY a <-> '-179.0' LIMIT 3;
-                     QUERY PLAN                     
-----------------------------------------------------
+                  QUERY PLAN                   
+-----------------------------------------------
  Limit
-   ->  Index Only Scan using float4idx on float4tmp
+   ->  Index Scan using float4idx on float4tmp
          Order By: (a <-> '-179'::real)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/float8.out b/contrib/btree_gist/expected/float8.out
index ebd0ef3d6..763091b5c 100644
--- a/contrib/btree_gist/expected/float8.out
+++ b/contrib/btree_gist/expected/float8.out
@@ -77,7 +77,7 @@ SELECT a, a <-> '-1890.0' FROM float8tmp ORDER BY a <-> '-1890.0' LIMIT 3;
                      QUERY PLAN                      
 -----------------------------------------------------
  Limit
-   ->  Index Only Scan using float8idx on float8tmp
+   ->  Index Scan using float8idx on float8tmp
          Order By: (a <-> '-1890'::double precision)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/int2.out b/contrib/btree_gist/expected/int2.out
index 50a332939..245fa4be6 100644
--- a/contrib/btree_gist/expected/int2.out
+++ b/contrib/btree_gist/expected/int2.out
@@ -74,10 +74,10 @@ SELECT count(*) FROM int2tmp WHERE a >  237::int2;
 
 EXPLAIN (COSTS OFF)
 SELECT a, a <-> '237' FROM int2tmp ORDER BY a <-> '237' LIMIT 3;
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                 
+-------------------------------------------
  Limit
-   ->  Index Only Scan using int2idx on int2tmp
+   ->  Index Scan using int2idx on int2tmp
          Order By: (a <-> '237'::smallint)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/int4.out b/contrib/btree_gist/expected/int4.out
index 6bbdc7c3f..41bed1f6e 100644
--- a/contrib/btree_gist/expected/int4.out
+++ b/contrib/btree_gist/expected/int4.out
@@ -74,10 +74,10 @@ SELECT count(*) FROM int4tmp WHERE a >  237::int4;
 
 EXPLAIN (COSTS OFF)
 SELECT a, a <-> '237' FROM int4tmp ORDER BY a <-> '237' LIMIT 3;
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                 
+-------------------------------------------
  Limit
-   ->  Index Only Scan using int4idx on int4tmp
+   ->  Index Scan using int4idx on int4tmp
          Order By: (a <-> 237)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/int8.out b/contrib/btree_gist/expected/int8.out
index eff77c26b..2bbdd7657 100644
--- a/contrib/btree_gist/expected/int8.out
+++ b/contrib/btree_gist/expected/int8.out
@@ -77,7 +77,7 @@ SELECT a, a <-> '464571291354841' FROM int8tmp ORDER BY a <-> '464571291354841'
                      QUERY PLAN                      
 -----------------------------------------------------
  Limit
-   ->  Index Only Scan using int8idx on int8tmp
+   ->  Index Scan using int8idx on int8tmp
          Order By: (a <-> '464571291354841'::bigint)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/interval.out b/contrib/btree_gist/expected/interval.out
index 4c3d494e4..4ed196198 100644
--- a/contrib/btree_gist/expected/interval.out
+++ b/contrib/btree_gist/expected/interval.out
@@ -77,7 +77,7 @@ SELECT a, a <-> '199 days 21:21:23' FROM intervaltmp ORDER BY a <-> '199 days 21
                                 QUERY PLAN                                 
 ---------------------------------------------------------------------------
  Limit
-   ->  Index Only Scan using intervalidx on intervaltmp
+   ->  Index Scan using intervalidx on intervaltmp
          Order By: (a <-> '@ 199 days 21 hours 21 mins 23 secs'::interval)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/time.out b/contrib/btree_gist/expected/time.out
index ec95ef77c..1b9da4e19 100644
--- a/contrib/btree_gist/expected/time.out
+++ b/contrib/btree_gist/expected/time.out
@@ -77,7 +77,7 @@ SELECT a, a <-> '10:57:11' FROM timetmp ORDER BY a <-> '10:57:11' LIMIT 3;
                           QUERY PLAN                          
 --------------------------------------------------------------
  Limit
-   ->  Index Only Scan using timeidx on timetmp
+   ->  Index Scan using timeidx on timetmp
          Order By: (a <-> '10:57:11'::time without time zone)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/timestamp.out b/contrib/btree_gist/expected/timestamp.out
index 0d94f2f24..cc3624f08 100644
--- a/contrib/btree_gist/expected/timestamp.out
+++ b/contrib/btree_gist/expected/timestamp.out
@@ -77,7 +77,7 @@ SELECT a, a <-> '2004-10-26 08:55:08' FROM timestamptmp ORDER BY a <-> '2004-10-
                                     QUERY PLAN                                     
 -----------------------------------------------------------------------------------
  Limit
-   ->  Index Only Scan using timestampidx on timestamptmp
+   ->  Index Scan using timestampidx on timestamptmp
          Order By: (a <-> 'Tue Oct 26 08:55:08 2004'::timestamp without time zone)
 (3 rows)
 
diff --git a/contrib/btree_gist/expected/timestamptz.out b/contrib/btree_gist/expected/timestamptz.out
index 75a15a425..88d2404c4 100644
--- a/contrib/btree_gist/expected/timestamptz.out
+++ b/contrib/btree_gist/expected/timestamptz.out
@@ -197,7 +197,7 @@ SELECT a, a <-> '2018-12-18 10:59:54 GMT+2' FROM timestamptztmp ORDER BY a <-> '
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
  Limit
-   ->  Index Only Scan using timestamptzidx on timestamptztmp
+   ->  Index Scan using timestamptzidx on timestamptztmp
          Order By: (a <-> 'Tue Dec 18 04:59:54 2018 PST'::timestamp with time zone)
 (3 rows)
 
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 6e1e51169..75c0704cc 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -172,6 +172,7 @@ typedef struct IndexAmRoutine
     amgetbatch_function amgetbatch; /* can be NULL */
     amunguardbatch_function amunguardbatch; /* can be NULL */
     amkillitemsbatch_function amkillitemsbatch;	/* can be NULL */
+    amgettransform_function amgettransform; /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
     amposreset_function amposreset; /* can be NULL */
@@ -716,27 +717,58 @@ ambeginscan (Relation indexRelation,
       and sibling page links).
      </para>
     </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_index_opaque_dyn</literal>: the size of an
+      optional second per-batch opaque area, or 0 if the index AM does not need
+      one.  Unlike the area above, its size need not be known at compile time;
+      the index AM may choose it at the start of each scan.  It sits immediately
+      before the static area, and core code treats it as a single opaque
+      allocation that the index AM lays out however it likes (for example, to
+      carry per-item match metadata, such as a recheck flag or order-by
+      distances, that must travel with the batch).
+     </para>
+    </listitem>
     <listitem>
      <para>
       <literal>scan-&gt;batch_tuples_workspace</literal>: the size in bytes
       of the per-batch tuple storage workspace used for index-only scans
       (typically <literal>BLCKSZ</literal>), or 0 if the index AM does not
-      support index-only scans.  The workspace is accessible via
-      <structfield>batch-&gt;currTuples</structfield>.
+      support index-only scans.  The workspace is accessible via the batch's
+      <structfield>currTuples</structfield> field.  The index AM stores each
+      matching tuple here in its on-disk format (an
+      <structname>IndexTuple</structname>, or another on-disk tuple form used by
+      the AM); it is either exposed directly as
+      <literal>scan-&gt;xs_itup</literal>, or converted to the returnable tuple
+      later, by <function>amgettransform</function> (see below).
      </para>
     </listitem>
    </itemizedlist>
   </para>
 
+  <para>
+   These batch fields are usually set in <function>ambeginscan</function>, but an
+   index access method may instead set any of them in
+   <function>amrescan</function> when their value cannot be determined until then.
+   For example, the size of the dynamic opaque area might depend on whether this
+   is an <link linkend="indexes-index-only-scans">index-only scan</link>
+   (<literal>scan-&gt;xs_want_itup</literal>), which core code only sets after
+   <function>ambeginscan</function> has returned; such an access method sizes
+   <literal>scan-&gt;batch_index_opaque_dyn</literal> in
+   <function>amrescan</function> instead.  This is safe because no batch is ever
+   allocated before the first <function>amrescan</function> call.
+  </para>
+
   <para>
    An <function>amgetbatch</function> access method whose recheck requirement is
    a fixed property of the whole scan (rather than something that varies from
    one matching item to the next) should also set
    <literal>scan-&gt;xs_recheck</literal> here, in
-   <function>ambeginscan</function>, since the value applies to every item the
-   scan returns.  The value set here persists across any subsequent
-   <function>amrescan</function> calls.  B-tree (always false) and hash (always
-   true) work this way.
+   <function>ambeginscan</function>: the value then applies to every item the
+   scan returns, and persists across any subsequent
+   <function>amrescan</function> calls.  See <function>amgetbatch</function>
+   below, which describes both this whole-scan case and the per-item case in
+   detail.
   </para>
 
   <para>
@@ -758,6 +790,13 @@ amrescan (IndexScanDesc scan,
    remains the same.
   </para>
 
+  <para>
+   <function>amrescan</function> is also where an
+   <function>amgetbatch</function> access method sets any of the batch fields
+   described under <function>ambeginscan</function> above whose value could not
+   be determined until now.
+  </para>
+
   <para>
 <programlisting>
 bool
@@ -894,23 +933,75 @@ amgetbatch (IndexScanDesc scan,
   </para>
 
   <para>
-   Index access methods using <function>amgetbatch</function> must set
-   <literal>scan-&gt;xs_recheck</literal> to indicate whether rechecking of
-   scan keys is required, in the same way as <function>amgettuple</function>
-   does. However, <literal>scan-&gt;xs_recheck</literal> must be set consistently
-   for an entire scan rather than varying on a per-tuple basis. This is a key
-   difference from <function>amgettuple</function>, which can set
-   <literal>scan-&gt;xs_recheck</literal> independently for each tuple it returns.
-   Index access methods that require granular control over
-   <literal>scan-&gt;xs_recheck</literal> must use the <function>amgettuple</function>
-   interface instead of <function>amgetbatch</function>.
+   Index access methods using <function>amgetbatch</function> must convey
+   whether the scan keys need to be rechecked, via
+   <literal>scan-&gt;xs_recheck</literal>, just as
+   <function>amgettuple</function> access methods do.  Unlike
+   <function>amgettuple</function>, however, an
+   <function>amgetbatch</function> access method cannot set
+   <literal>scan-&gt;xs_recheck</literal> at the point an individual item is
+   returned, because the interface decouples the order of
+   <function>amgetbatch</function> calls from the order in which items are
+   later returned to the scan.  When the recheck requirement is a fixed
+   property of the whole scan, the index access method instead sets
+   <literal>scan-&gt;xs_recheck</literal> once, at scan start (in its
+   <function>ambeginscan</function> routine): B-tree always sets it false, and
+   hash always sets it true.  When the requirement instead varies from one
+   matching item to the next, the index access method records the per-item
+   value in the batch and provides an <function>amgettransform</function>
+   callback (see below), which the table AM invokes for each returned item to
+   set <literal>scan-&gt;xs_recheck</literal> from that recorded state; GiST
+   works this way.
   </para>
 
   <para>
-   Similarly, the <function>amgetbatch</function> interface does not currently
-   support index-only scans that return data in the form of a
-   <structname>HeapTuple</structname> pointer stored in
-   <literal>scan-&gt;xs_hitup</literal>.
+   An <function>amgetbatch</function> access method that supports index-only
+   scans must supply the scan's returnable tuple for each matching item, and
+   must do so in one of exactly two ways.  Which one applies is a fixed
+   property of the access method, determined by whether it provides an
+   <function>amgettransform</function> callback:
+   <itemizedlist>
+    <listitem>
+     <para>
+      If the index access method does <emphasis>not</emphasis> provide
+      <function>amgettransform</function>, it must store each matching tuple
+      in the batch's <structfield>currTuples</structfield> workspace as an
+      on-disk <structname>IndexTuple</structname> whose layout is exactly what
+      <literal>scan-&gt;xs_itupdesc</literal> describes.  The table AM exposes
+      those stored bytes directly as <literal>scan-&gt;xs_itup</literal> and
+      deforms them against <literal>xs_itupdesc</literal> (just as for an
+      <function>amgettuple</function> index-only scan); the index access
+      method must not set <literal>scan-&gt;xs_itup</literal> itself.  Among
+      the core access methods only B-tree uses this path, because it stores
+      the original indexed values unchanged, so the stored tuple already
+      matches <literal>xs_itupdesc</literal>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Otherwise the index access method must provide an
+      <function>amgettransform</function> callback that produces the
+      returnable tuple in <literal>scan-&gt;xs_hitup</literal> (a
+      <structname>HeapTuple</structname> matching
+      <literal>scan-&gt;xs_hitupdesc</literal>).  This gives the access method
+      complete freedom to form that tuple from whatever it stored in
+      <structfield>currTuples</structfield>, in whatever on-disk format suits
+      it.  GiST uses this path, because the representation it
+      stores differs from the indexed value and so could not satisfy the
+      <literal>xs_itupdesc</literal> layout directly.
+     </para>
+    </listitem>
+   </itemizedlist>
+   The first path is generic, but useful only to an access method that &mdash;
+   like B-tree &mdash; already stores tuples in exactly the indexed-attribute
+   format; an access method that stores some other representation must take
+   the second path.  The two paths are mutually exclusive: an
+   <function>amgetbatch</function> access method takes one or the other, never
+   both.  (For historical reasons an <function>amgettuple</function> access
+   method is allowed to set both <literal>scan-&gt;xs_itup</literal> and
+   <literal>scan-&gt;xs_hitup</literal> for the same scan &mdash; the
+   heap-tuple form is then used &mdash; but that latitude is a legacy quirk
+   that <function>amgetbatch</function> deliberately does not repeat.)
   </para>
 
   <para>
@@ -940,8 +1031,10 @@ amunguardbatch (IndexScanDesc scan,
    is not even required to use the standard helper
    <function>indexam_util_unlock_batch</function> to manage it.  In practice,
    though, most or all index AMs will use that helper and hold the simplest
-   possible interlock: each guarded B-tree or hash batch keeps a single
-   buffer pin on the one index page the batch came from.  See <xref
+   possible interlock: each guarded B-tree, hash, or GiST batch keeps a
+   single buffer pin on the one index page the batch came from.  (The
+   <quote>virtual</quote> nearest-neighbor batches that GiST uses for ordered
+   scans are not guarded, and hold no such pin.)  See <xref
     linkend="index-locking"/> for details on buffer pin management during
    index scans.  This function will be called at most once for each guarded
    batch; it is not called when the index AM has already unguarded the batch
@@ -985,8 +1078,8 @@ amkillitemsbatch (IndexScanDesc scan,
    <function>amgetbatch</function> index AMs (those that don't can leave
    the field set to <literal>NULL</literal>), but doing so is recommended for
    performance, as it allows future scans to skip known-dead index entries.
-   Both core index access methods that currently support
-   <function>amgetbatch</function> (B-tree and hash) implement
+   All three core index access methods that currently support
+   <function>amgetbatch</function> (B-tree, hash, and GiST) implement
    <literal>LP_DEAD</literal> marking, though third-party index access methods
    are free to choose whether to implement this feature.  The table AM may
    call <function>tableam_util_scanpos_killitem</function> to mark dead items as
@@ -1028,7 +1121,7 @@ amkillitemsbatch (IndexScanDesc scan,
    <command>VACUUM</command> recycling table TIDs &mdash; so it would be
    unsafe to assume that index entries still point to the same heap/table
    tuples.  Since <literal>LP_DEAD</literal> marking is only an optimization
-   hint, it is always safe to skip it.  Both B-tree and hash use this
+   hint, it is always safe to skip it.  B-tree, hash, and GiST use this
    approach.
   </para>
 
@@ -1067,6 +1160,41 @@ amkillitemsbatch (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amgettransform (IndexScanDesc scan,
+                IndexScanBatch batch,
+                int item);
+</programlisting>
+   Called by the table AM as it returns each matching item
+   (<replaceable>item</replaceable> is an index into the batch's
+   <structfield>items</structfield> array) of an <function>amgetbatch</function>
+   scan, to set up the scan's per-tuple output from per-item state that the
+   access method recorded in the batch.  This is needed when that output cannot
+   be a fixed property of the whole scan.  An access method may use it to set
+   <literal>scan-&gt;xs_recheck</literal> (when the need to recheck the scan
+   conditions varies from one matching item to the next), to set
+   <structfield>xs_orderbyvals</structfield> and
+   <structfield>xs_recheckorderby</structfield> for an ordered
+   (nearest-neighbor) scan, and to set <literal>scan-&gt;xs_hitup</literal> for
+   an index-only scan whose returnable tuple must be reconstructed rather than
+   returned directly as a stored index tuple.
+  </para>
+
+  <para>
+   Implementing <function>amgettransform</function> is optional, and is only
+   meaningful together with <function>amgetbatch</function>.  An access method
+   need only provide it when some part of its per-tuple output varies from one
+   matching item to the next.  When every such output is instead a fixed
+   property of the whole scan &mdash; or, for index-only scans, is the on-disk
+   index tuple returned directly via <literal>scan-&gt;xs_itup</literal> &mdash;
+   the field can be left <literal>NULL</literal>, as B-tree and hash do.  GiST
+   provides one because parts of its per-tuple output (the recheck flag, the
+   <literal>ORDER BY</literal> distances, and the reconstructed index-only
+   tuples) vary per matching item, as described above.
+  </para>
+
+  <para>
+<programlisting>
 int64
 amgetbitmap (IndexScanDesc scan,
              TIDBitmap *tbm);
@@ -1364,8 +1492,26 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   Note that <function>amgetbatch</function> scans do not currently support
-   ordering operators.
+   An <function>amgetbatch</function> access method can support ordering
+   operators by providing an <function>amgettransform</function> callback: it
+   records each matching item's ordering values in the batch, and the table AM
+   calls <function>amgettransform</function> as it returns each item to set
+   <structfield>xs_orderbyvals</structfield> and
+   <structfield>xs_recheckorderby</structfield> from that recorded state.  GiST
+   uses this for nearest-neighbor scans.  As with
+   <literal>scan-&gt;xs_recheck</literal>, these values cannot be set directly as
+   items are returned.
+  </para>
+
+  <para>
+   Scans that use ordering operators are never planned as index-only scans.
+   Because an ordered scan can collect matching items from many index leaf
+   pages without retaining a buffer pin on any of them (GiST's
+   <quote>virtual</quote> nearest-neighbor batches work this way), it has no
+   pin to serve as the interlock against concurrent TID recycling that an
+   index-only scan depends on (see <xref linkend="index-locking"/>).  The
+   planner therefore costs and executes such scans as plain index scans, which
+   always fetch and recheck the heap tuple.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 3f5be6082..c6990cab5 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -338,6 +338,7 @@ dihandler(PG_FUNCTION_ARGS)
 		.amgetbatch = NULL,
 		.amunguardbatch = NULL,
 		.amkillitemsbatch = NULL,
+		.amgettransform = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
 		.amposreset = NULL,
diff --git a/src/test/modules/index/expected/killtuples.out b/src/test/modules/index/expected/killtuples.out
index a3db2c409..110c3d445 100644
--- a/src/test/modules/index/expected/killtuples.out
+++ b/src/test/modules/index/expected/killtuples.out
@@ -152,6 +152,83 @@ f
 step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
 step drop_ext_btree_gist: DROP EXTENSION btree_gist;
 
+starting permutation: create_table fill_500 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access_ordered flush result measure access_ordered flush result delete flush measure access_ordered flush result measure access_ordered flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access_ordered: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple ORDER BY key <-> 1;
+QUERY PLAN                                                                             
+---------------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=500.00 loops=1)
+  Order By: (key <-> 1)                                                                
+  Index Searches: 1                                                                    
+(3 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) > 0) AS has_new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+has_new_heap_accesses
+---------------------
+t                    
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access_ordered: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple ORDER BY key <-> 1;
+QUERY PLAN                                                                             
+---------------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=500.00 loops=1)
+  Order By: (key <-> 1)                                                                
+  Index Searches: 1                                                                    
+(3 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) > 0) AS has_new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+has_new_heap_accesses
+---------------------
+t                    
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access_ordered: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple ORDER BY key <-> 1;
+QUERY PLAN                                                                           
+-------------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0.00 loops=1)
+  Order By: (key <-> 1)                                                              
+  Index Searches: 1                                                                  
+(3 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) > 0) AS has_new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+has_new_heap_accesses
+---------------------
+t                    
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access_ordered: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple ORDER BY key <-> 1;
+QUERY PLAN                                                                           
+-------------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0.00 loops=1)
+  Order By: (key <-> 1)                                                              
+  Index Searches: 1                                                                  
+(3 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) > 0) AS has_new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+has_new_heap_accesses
+---------------------
+t                    
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
 starting permutation: create_table fill_10 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
 step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
 step fill_10: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i);
@@ -223,7 +300,7 @@ step flush: SELECT FROM pg_stat_force_next_flush();
 step result: SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) > 0) AS has_new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
 has_new_heap_accesses
 ---------------------
-t                    
+f                    
 (1 row)
 
 step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
diff --git a/src/test/modules/index/specs/killtuples.spec b/src/test/modules/index/specs/killtuples.spec
index 3b98ff9f7..f5d2fd773 100644
--- a/src/test/modules/index/specs/killtuples.spec
+++ b/src/test/modules/index/specs/killtuples.spec
@@ -47,6 +47,9 @@ step result { SELECT ((heap_blks_read + heap_blks_hit - counter.heap_accesses) >
 
 step access { EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple WHERE key = 1; }
 
+# nearest-neighbor (order-by operator) scan (cannot set LP_DEAD bits)
+step access_ordered { EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF, BUFFERS OFF) SELECT * FROM kill_prior_tuple ORDER BY key <-> 1; }
+
 step delete { DELETE FROM kill_prior_tuple; }
 
 step drop_table { DROP TABLE IF EXISTS kill_prior_tuple; }
@@ -96,7 +99,20 @@ permutation
   measure access flush result
   drop_table drop_ext_btree_gist
 
-# Test gist, but with fewer rows - shows that killitems doesn't work anymore!
+# GiST doesn't set LP_DEAD bits for ordered scans, so every access re-visits
+# the heap
+permutation
+  create_table fill_500 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access_ordered flush result
+  measure access_ordered flush result
+  delete flush
+  measure access_ordered flush result
+  measure access_ordered flush result
+  drop_table drop_ext_btree_gist
+
+# Test gist with fewer rows, exercising the case where all the dead tuples are
+# on a single page
 permutation
   create_table fill_10 create_ext_btree_gist create_gist flush
   disable_seq disable_bitmap
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c4..970b857c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -475,9 +475,9 @@ SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
-                  QUERY PLAN                  
-----------------------------------------------
- Index Only Scan using gpointind on point_tbl
+               QUERY PLAN                
+-----------------------------------------
+ Index Scan using gpointind on point_tbl
    Order By: (f1 <-> '(0,1)'::point)
 (2 rows)
 
@@ -513,9 +513,9 @@ SELECT * FROM point_tbl WHERE f1 IS NULL;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
-                  QUERY PLAN                  
-----------------------------------------------
- Index Only Scan using gpointind on point_tbl
+               QUERY PLAN                
+-----------------------------------------
+ Index Scan using gpointind on point_tbl
    Index Cond: (f1 IS NOT NULL)
    Order By: (f1 <-> '(0,1)'::point)
 (3 rows)
@@ -539,7 +539,7 @@ EXPLAIN (COSTS OFF)
 SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
                    QUERY PLAN                   
 ------------------------------------------------
- Index Only Scan using gpointind on point_tbl
+ Index Scan using gpointind on point_tbl
    Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
    Order By: (f1 <-> '(0,1)'::point)
 (3 rows)
diff --git a/src/test/regress/expected/create_index_spgist.out b/src/test/regress/expected/create_index_spgist.out
index c6beb0efa..ddffca2e7 100644
--- a/src/test/regress/expected/create_index_spgist.out
+++ b/src/test/regress/expected/create_index_spgist.out
@@ -333,7 +333,7 @@ FROM quad_point_tbl;
 ----------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(0,0)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_quad_ind on quad_point_tbl
+   ->  Index Scan using sp_quad_ind on quad_point_tbl
          Order By: (p <-> '(0,0)'::point)
 (4 rows)
 
@@ -354,7 +354,7 @@ FROM quad_point_tbl WHERE p <@ box '(200,200,1000,1000)';
 ----------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(0,0)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_quad_ind on quad_point_tbl
+   ->  Index Scan using sp_quad_ind on quad_point_tbl
          Index Cond: (p <@ '(1000,1000),(200,200)'::box)
          Order By: (p <-> '(0,0)'::point)
 (5 rows)
@@ -376,7 +376,7 @@ FROM quad_point_tbl WHERE p IS NOT NULL;
 --------------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(333,400)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_quad_ind on quad_point_tbl
+   ->  Index Scan using sp_quad_ind on quad_point_tbl
          Index Cond: (p IS NOT NULL)
          Order By: (p <-> '(333,400)'::point)
 (5 rows)
@@ -503,7 +503,7 @@ FROM kd_point_tbl;
 ----------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(0,0)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_kd_ind on kd_point_tbl
+   ->  Index Scan using sp_kd_ind on kd_point_tbl
          Order By: (p <-> '(0,0)'::point)
 (4 rows)
 
@@ -524,7 +524,7 @@ FROM kd_point_tbl WHERE p <@ box '(200,200,1000,1000)';
 ----------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(0,0)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_kd_ind on kd_point_tbl
+   ->  Index Scan using sp_kd_ind on kd_point_tbl
          Index Cond: (p <@ '(1000,1000),(200,200)'::box)
          Order By: (p <-> '(0,0)'::point)
 (5 rows)
@@ -546,7 +546,7 @@ FROM kd_point_tbl WHERE p IS NOT NULL;
 --------------------------------------------------------------------------------
  WindowAgg
    Window: w1 AS (ORDER BY (p <-> '(333,400)'::point) ROWS UNBOUNDED PRECEDING)
-   ->  Index Only Scan using sp_kd_ind on kd_point_tbl
+   ->  Index Scan using sp_kd_ind on kd_point_tbl
          Index Cond: (p IS NOT NULL)
          Order By: (p <-> '(333,400)'::point)
 (5 rows)
@@ -567,10 +567,10 @@ SET extra_float_digits = 0;
 CREATE INDEX ON quad_point_tbl_ord_seq1 USING spgist(p) INCLUDE(dist);
 EXPLAIN (COSTS OFF)
 SELECT p, dist FROM quad_point_tbl_ord_seq1 ORDER BY p <-> '0,0' LIMIT 10;
-                                        QUERY PLAN                                         
--------------------------------------------------------------------------------------------
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
  Limit
-   ->  Index Only Scan using quad_point_tbl_ord_seq1_p_dist_idx on quad_point_tbl_ord_seq1
+   ->  Index Scan using quad_point_tbl_ord_seq1_p_dist_idx on quad_point_tbl_ord_seq1
          Order By: (p <-> '(0,0)'::point)
 (3 rows)
 
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index c75bbb23b..810db8b8f 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -74,13 +74,13 @@ select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5));
  (0.5,0.5)
 (11 rows)
 
--- Also test an index-only knn-search
+-- Also test a knn-search
 explain (costs off)
 select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5))
 order by p <-> point(0.201, 0.201);
-                       QUERY PLAN                       
---------------------------------------------------------
- Index Only Scan using gist_tbl_point_index on gist_tbl
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using gist_tbl_point_index on gist_tbl
    Index Cond: (p <@ '(0.5,0.5),(0,0)'::box)
    Order By: (p <-> '(0.201,0.201)'::point)
 (3 rows)
@@ -106,9 +106,9 @@ order by p <-> point(0.201, 0.201);
 explain (costs off)
 select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5))
 order by point(0.101, 0.101) <-> p;
-                       QUERY PLAN                       
---------------------------------------------------------
- Index Only Scan using gist_tbl_point_index on gist_tbl
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using gist_tbl_point_index on gist_tbl
    Index Cond: (p <@ '(0.5,0.5),(0,0)'::box)
    Order By: (p <-> '(0.101,0.101)'::point)
 (3 rows)
@@ -138,12 +138,12 @@ select p from
           (box(point(0.8,0.8), point(1.0,1.0)))) as v(bb)
 cross join lateral
   (select p from gist_tbl where p <@ bb order by p <-> bb[0] limit 2) ss;
-                             QUERY PLAN                             
---------------------------------------------------------------------
+                          QUERY PLAN                           
+---------------------------------------------------------------
  Nested Loop
    ->  Values Scan on "*VALUES*"
    ->  Limit
-         ->  Index Only Scan using gist_tbl_point_index on gist_tbl
+         ->  Index Scan using gist_tbl_point_index on gist_tbl
                Index Cond: (p <@ "*VALUES*".column1)
                Order By: (p <-> ("*VALUES*".column1)[0])
 (6 rows)
@@ -203,13 +203,13 @@ select b from gist_tbl where b <@ box(point(5,5), point(6,6));
  (6,6),(6,6)
 (21 rows)
 
--- Also test an index-only knn-search
+-- Also test a knn-search
 explain (costs off)
 select b from gist_tbl where b <@ box(point(5,5), point(6,6))
 order by b <-> point(5.2, 5.91);
-                      QUERY PLAN                      
-------------------------------------------------------
- Index Only Scan using gist_tbl_box_index on gist_tbl
+                   QUERY PLAN                    
+-------------------------------------------------
+ Index Scan using gist_tbl_box_index on gist_tbl
    Index Cond: (b <@ '(6,6),(5,5)'::box)
    Order By: (b <-> '(5.2,5.91)'::point)
 (3 rows)
@@ -245,9 +245,9 @@ order by b <-> point(5.2, 5.91);
 explain (costs off)
 select b from gist_tbl where b <@ box(point(5,5), point(6,6))
 order by point(5.2, 5.91) <-> b;
-                      QUERY PLAN                      
-------------------------------------------------------
- Index Only Scan using gist_tbl_box_index on gist_tbl
+                   QUERY PLAN                    
+-------------------------------------------------
+ Index Scan using gist_tbl_box_index on gist_tbl
    Index Cond: (b <@ '(6,6),(5,5)'::box)
    Order By: (b <-> '(5.2,5.91)'::point)
 (3 rows)
@@ -373,20 +373,26 @@ select count(*) from gist_tbl;
  10001
 (1 row)
 
--- This case isn't supported, but it should at least EXPLAIN correctly.
+-- An ordering-operator (nearest-neighbor) scan is never planned as an
+-- index-only scan, so this lossy-distance case runs as a plain index scan that
+-- rechecks the distances against the heap tuple.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
+                                    QUERY PLAN                                    
+----------------------------------------------------------------------------------
  Limit
    Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
+   ->  Index Scan using gist_tbl_multi_index on public.gist_tbl
          Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
+         Order By: (circle(gist_tbl.p, '1'::double precision) <-> '(0,0)'::point)
 (5 rows)
 
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-ERROR:  lossy distance functions are not supported in index-only scans
+   p   
+-------
+ (0,0)
+(1 row)
+
 -- Force an index build using buffering.
 create index gist_tbl_box_index_forcing_buffering on gist_tbl using gist (p)
   with (buffering=on, fillfactor=50);
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index 6f1fc65f1..369eb4576 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -65,7 +65,7 @@ select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5));
 -- execute the same
 select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5));
 
--- Also test an index-only knn-search
+-- Also test a knn-search
 explain (costs off)
 select p from gist_tbl where p <@ box(point(0,0), point(0.5, 0.5))
 order by p <-> point(0.201, 0.201);
@@ -109,7 +109,7 @@ select b from gist_tbl where b <@ box(point(5,5), point(6,6));
 -- execute the same
 select b from gist_tbl where b <@ box(point(5,5), point(6,6));
 
--- Also test an index-only knn-search
+-- Also test a knn-search
 explain (costs off)
 select b from gist_tbl where b <@ box(point(5,5), point(6,6))
 order by b <-> point(5.2, 5.91);
@@ -164,7 +164,9 @@ explain (verbose, costs off)
 select count(*) from gist_tbl;
 select count(*) from gist_tbl;
 
--- This case isn't supported, but it should at least EXPLAIN correctly.
+-- An ordering-operator (nearest-neighbor) scan is never planned as an
+-- index-only scan, so this lossy-distance case runs as a plain index scan that
+-- rechecks the distances against the heap tuple.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 446e68a84..d3ab27607 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1063,6 +1063,8 @@ GBT_NUMKEY_R
 GBT_VARKEY
 GBT_VARKEY_R
 GENERAL_NAME
+GISTBatchData
+GISTBatchItem
 GISTBuildBuffers
 GISTBuildState
 GISTDeletedPageContents
-- 
2.53.0



  [application/octet-stream] v28-0006-WIP-Adopt-amgetbatch-interface-in-SP-GiST-index-.patch (68.4K, 8-v28-0006-WIP-Adopt-amgetbatch-interface-in-SP-GiST-index-.patch)
  download | inline diff:
From 7d85357c00b4b768af436bbfe38539bff75d1e4a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Thu, 4 Jun 2026 23:09:37 -0400
Subject: [PATCH v28 06/11] WIP: Adopt amgetbatch interface in SP-GiST index
 AM.

Replace spggettuple with spggetbatch, which implements the amgetbatch
interface added by commit FIXME.  Plain index scans of SP-GiST indexes
now return matching items in batches consisting of all of the matches
from a given leaf page, giving the table AM the ability to perform
optimizations like index prefetching during SP-GiST index scans.

As in nbtree, hash, and GiST, an ordinary batch's only retained buffer
pin is the one on its single leaf page, held as the standardized
interlock against unsafe concurrent TID recycling by VACUUM, for as long
as the table AM still needs it.  Nearest-neighbor (ordered) scans work
as in GiST: spggetbatch drains the distance-ordered queue into one
"virtual" batch spanning many leaf pages.

The interlock pin also fixes a pre-existing bug in which SP-GiST
index-only scans could return wrong answers.  This is exactly the same
race condition that commit FIXME (which taught GiST to use the
amgetbatch interface) fixed in GiST.  As with GiST, we rely on the
planner disallowing ordered SP-GiST scans to close the gap there
(SP-GiST also uses "virtual batches" during ordered scans, which make a
conventional leaf page pin interlock impractical, just like in GiST).

There is an additional restriction on index-only scans, which is a
separate issue that is peculiar to SP-GiST: index-only scans are now
disabled for "long values" opclasses such as the text radix opclass.
These opclasses use reconstructed values whose size is essentially
unbounded.  The prefix cannot reliably fit into a fixed per-batch
reconstruction workspace.  There doesn't appear to be a simple way to
solve that resource management problem within the confines of the
amgetbatch design, and inventing new infrastructure to make it work
doesn't seem likely to pay for itself.  This warrants a separate SP-GiST
only incompatibility item in the Postgres 20 release notes (in addition
to an item about GiST _and_ SP-GiST not supporting ordered index-only
scans anymore).

Author: Peter Geoghegan <[email protected]>
---
 src/include/access/spgist.h                   |   5 +-
 src/include/access/spgist_private.h           | 102 +-
 src/backend/access/spgist/README              |  11 +-
 src/backend/access/spgist/spgscan.c           | 908 +++++++++++++-----
 src/backend/access/spgist/spgutils.c          |   8 +-
 src/backend/access/spgist/spgvacuum.c         |  68 +-
 src/backend/access/spgist/spgxlog.c           |  12 +-
 doc/src/sgml/indexam.sgml                     |  40 +-
 doc/src/sgml/spgist.sgml                      |   7 +-
 .../expected/spgist_name_ops.out              |   4 +-
 src/test/regress/expected/amutils.out         |   2 +-
 .../regress/expected/create_index_spgist.out  |  50 +-
 src/tools/pgindent/typedefs.list              |   3 +-
 13 files changed, 853 insertions(+), 367 deletions(-)

diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 083d93f8f..3c2582e76 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -208,7 +208,10 @@ extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
 extern int64 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch spggetbatch(IndexScanDesc scan, IndexScanBatch priorbatch,
+								  ScanDirection dir);
+extern void spgunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void spggettransform(IndexScanDesc scan, IndexScanBatch batch, int item);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index ec6d6f5f7..ff8920140 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -14,6 +14,7 @@
 #ifndef SPGIST_PRIVATE_H
 #define SPGIST_PRIVATE_H
 
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "access/spgist.h"
 #include "catalog/pg_am_d.h"
@@ -183,6 +184,81 @@ typedef struct SpGistSearchItem
 #define SizeOfSpGistSearchItem(n_distances) \
 	(offsetof(SpGistSearchItem, distances) + sizeof(double) * (n_distances))
 
+/*
+ * Per-batch data private to the SP-GiST index AM (the static index AM opaque
+ * area of an IndexScanBatch).
+ *
+ * A non-ordered batch holds all matches from a single leaf page, and its buffer
+ * pin is the TID recycling interlock.  An ordered (nearest-neighbor) scan
+ * instead returns a "virtual" batch, drained from the distance-ordered queue
+ * and spanning many leaf pages; it holds no pin (blkno == InvalidBlockNumber).
+ *
+ * reconValue/level/isNull are the shared inputs spggettransform uses to
+ * reconstruct an index-only scan's values; the prefix is the same for every
+ * match in a non-ordered batch.  (Ordered scans are never index-only.)
+ */
+typedef struct SpGistBatchData
+{
+	Buffer		buf;			/* leaf page's pin (InvalidBuffer if virtual) */
+	BlockNumber blkno;			/* leaf blkno (InvalidBlockNumber == virtual) */
+	Datum		reconValue;		/* prefix; into the recon area when
+								 * by-reference */
+	int			level;
+	bool		isNull;			/* batch came from a nulls page */
+} SpGistBatchData;
+
+#define SpGistBatchGetData(scan, batch) \
+	index_scan_batch_index_opaque_static(scan, batch, SpGistBatchData)
+
+/*
+ * Per-item data for an ordered (virtual) batch: an array in the dynamic opaque
+ * area, subscripted via SpGistBatchGetItem.  Each item has its own recheck flag
+ * (SP-GiST matching is lossy, varying per item) plus its ORDER BY distances.
+ *
+ * A non-ordered batch needs only a recheck flag per item, so its dynamic opaque
+ * area is a plain bool array, subscripted via SpGistBatchGetRecheck.
+ */
+typedef struct SpGistBatchItem
+{
+	bool		recheck;		/* T if quals must be rechecked */
+	bool		recheckDistances;	/* T if distances are lossy lower bounds */
+	IndexOrderByDistance distances[FLEXIBLE_ARRAY_MEMBER];	/* numberOfOrderBys */
+} SpGistBatchItem;
+
+#define SizeOfSpGistBatchItem(n_distances) \
+	(offsetof(SpGistBatchItem, distances) + \
+	 sizeof(IndexOrderByDistance) * (n_distances))
+
+/* Subscript an ordered (virtual) batch's item array */
+#define SpGistBatchGetItem(scan, batch, item) \
+	(AssertMacro(((SpGistScanOpaque) (scan)->opaque)->numberOfNonNullOrderBys > 0), \
+	 AssertMacro((item) >= 0 && (item) < MaxIndexTuplesPerPage), \
+	 (SpGistBatchItem *) ((char *) index_scan_batch_index_opaque_dyn((scan), (batch)) + \
+						  (Size) (item) * SizeOfSpGistBatchItem((scan)->numberOfOrderBys)))
+
+/* Subscript a non-ordered batch's recheck-flag array */
+#define SpGistBatchGetRecheck(scan, batch) \
+	(AssertMacro(((SpGistScanOpaque) (scan)->opaque)->numberOfNonNullOrderBys == 0), \
+	 (bool *) index_scan_batch_index_opaque_dyn((scan), (batch)))
+
+/* Size of each layout's per-item array within the dynamic opaque area */
+#define SpGistBatchItemArraySize(scan) \
+	MAXALIGN(SizeOfSpGistBatchItem((scan)->numberOfOrderBys) * (scan)->maxitemsbatch)
+#define SpGistBatchRecheckArraySize(scan) \
+	MAXALIGN(sizeof(bool) * (scan)->maxitemsbatch)
+
+/*
+ * For an index-only scan, the shared by-reference reconstruction prefix
+ * (SpGistBatchData.reconValue) is stored after the per-item array in the
+ * dynamic opaque area, not in currTuples: the prefix is reconstructed from
+ * ancestor inner pages, so it isn't bounded by the one leaf page that
+ * currTuples is sized for.  Index-only scans are always non-ordered, so it
+ * follows the recheck array.
+ */
+#define SpGistBatchGetReconArea(scan, batch) \
+	((char *) index_scan_batch_index_opaque_dyn((scan), (batch)) + \
+	 SpGistBatchRecheckArraySize(scan))
+
 /*
  * Private state of an index scan
  */
@@ -217,29 +293,9 @@ typedef struct SpGistScanOpaqueData
 	double	   *zeroDistances;
 	double	   *infDistances;
 
-	/* These fields are only used in amgetbitmap scans: */
-	TIDBitmap  *tbm;			/* bitmap being filled */
-	int64		ntids;			/* number of TIDs passed to bitmap */
-
-	/* These fields are only used in amgettuple scans: */
-	bool		want_itup;		/* are we reconstructing tuples? */
-	TupleDesc	reconTupDesc;	/* if so, descriptor for reconstructed tuples */
-	int			nPtrs;			/* number of TIDs found on current page */
-	int			iPtr;			/* index for scanning through same */
-	ItemPointerData heapPtrs[MaxIndexTuplesPerPage];	/* TIDs from cur page */
-	bool		recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
-	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
-															 * flags */
-	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
-
-	/* distances (for recheck) */
-	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
-
-	/*
-	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
-	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
-	 * so this is safe.
-	 */
+	/* These fields are only used in amgetbatch scans: */
+	TupleDesc	reconTupDesc;	/* descriptor for reconstructed tuples */
+	MemoryContext reconCxt;		/* context for lazily reconstructed xs_hitup */
 } SpGistScanOpaqueData;
 
 typedef SpGistScanOpaqueData *SpGistScanOpaque;
diff --git a/src/backend/access/spgist/README b/src/backend/access/spgist/README
index 7117e02c7..e37240992 100644
--- a/src/backend/access/spgist/README
+++ b/src/backend/access/spgist/README
@@ -352,8 +352,8 @@ target TID is not acceptable, so we have to extend the algorithm to cope
 with such cases.  We recognize that such a move might have occurred when
 we see a leaf-page REDIRECT tuple whose XID indicates it might have been
 created after the VACUUM scan started.  We add the redirection target TID
-to a "pending list" of places we need to recheck.  Between pages of the
-main sequential scan, we empty the pending list by visiting each listed
+to a "pending list" of places we need to recheck.  During the main
+sequential scan, we empty the pending list by visiting each listed
 TID.  If it points to an inner tuple (from a PickSplit), add each downlink
 TID to the pending list.  If it points to a leaf page, vacuum that page.
 (We could just vacuum the single pointed-to chain, but vacuuming the
@@ -365,6 +365,13 @@ only after we've completed all pending-list processing; instead we just
 mark items as done after processing them.  Adding a TID that's already in
 the list is a no-op, whether or not that item is marked done yet.
 
+On a leaf page, VACUUM takes a cleanup lock rather than a plain exclusive lock.
+This is the interlock that makes index-only scans safe against concurrent TID
+recycling: such a scan keeps a pin on the leaf page it read heap TIDs from until
+it has consulted the visibility map, and VACUUM cannot make any of that page's
+TIDs recyclable until spgbulkdelete returns, which it cannot do until it has
+cleanup-locked that page behind the scan's pin.
+
 spgbulkdelete also updates the index's free space map.
 
 Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 2cc5f06f5..e6b15d2cc 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -28,10 +28,12 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
-							   Datum leafValue, bool isNull,
-							   SpGistLeafTuple leafTuple, bool recheck,
-							   bool recheckDistances, double *distances);
+static Buffer spgReadItemPage(IndexScanDesc scan, SpGistSearchItem *item,
+							  Buffer buffer);
+static void spgProcessInnerPage(IndexScanDesc scan, SpGistSearchItem *item,
+								Page page);
+static void spgProcessLeafPage(IndexScanDesc scan, SpGistSearchItem *item,
+							   Page page, IndexScanBatch batch);
 
 /*
  * Pairing heap comparison function for the SpGistSearchItem queue.
@@ -172,26 +174,6 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
 		spgAddStartItem(so, false);
 
 	MemoryContextSwitchTo(oldCtx);
-
-	if (so->numberOfOrderBys > 0)
-	{
-		/* Must pfree distances to avoid memory leak */
-		int			i;
-
-		for (i = 0; i < so->nPtrs; i++)
-			if (so->distances[i])
-				pfree(so->distances[i]);
-	}
-
-	if (so->want_itup)
-	{
-		/* Must pfree reconstructed tuples to avoid memory leak */
-		int			i;
-
-		for (i = 0; i < so->nPtrs; i++)
-			pfree(so->reconTups[i]);
-	}
-	so->iPtr = so->nPtrs = 0;
 }
 
 /*
@@ -332,6 +314,9 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 	 */
 	so->reconTupDesc = scan->xs_hitupdesc =
 		getSpGistTupleDesc(rel, &so->state.attType);
+	so->reconCxt = AllocSetContextCreate(CurrentMemoryContext,
+										 "SP-GiST reconstruction context",
+										 ALLOCSET_SMALL_SIZES);
 
 	/* Allocate various arrays needed for order-by scans */
 	if (scan->numberOfOrderBys > 0)
@@ -354,6 +339,27 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 		scan->xs_orderbynulls = palloc_array(bool, scan->numberOfOrderBys);
 		memset(scan->xs_orderbynulls, true,
 			   sizeof(bool) * scan->numberOfOrderBys);
+
+		/*
+		 * Ordered scans fill a "virtual" batch by draining the
+		 * distance-ordered queue, so the batch size is a tuning knob with no
+		 * natural value. Testing has shown that a very small size will
+		 * increase per-batch overhead (and likely instruction-cache misses),
+		 * while a large size (such as MaxIndexTuplesPerPage) risks producing
+		 * many tuples that a LIMIT node never consumes.  This maxitemsbatch
+		 * is a compromise.
+		 */
+		scan->maxitemsbatch = MaxIndexTuplesPerPage / 32;
+	}
+	else
+	{
+		/*
+		 * A non-ordered batch holds all of the matches from a single leaf
+		 * page, so one page's worth of items is the natural cap.  Using
+		 * MaxIndexTuplesPerPage is a bit hokey since SpGistLeafTuples aren't
+		 * exactly IndexTuples; however, they are larger, so this is safe.
+		 */
+		scan->maxitemsbatch = MaxIndexTuplesPerPage;
 	}
 
 	fmgr_info_copy(&so->innerConsistentFn,
@@ -366,6 +372,9 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 
 	so->indexCollation = rel->rd_indcollation[0];
 
+	scan->batch_index_opaque_static = MAXALIGN(sizeof(SpGistBatchData));
+	scan->batch_tuples_workspace = BLCKSZ;
+
 	scan->opaque = so;
 
 	return scan;
@@ -411,9 +420,31 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* preprocess scankeys, set up the representation in *so */
 	spgPrepareScanKeys(scan);
 
+	/*
+	 * Size the dynamic opaque area now that the scan keys (and xs_want_itup)
+	 * are known.  Ordered (virtual) batches need a full SpGistBatchItem array
+	 * (each item's ORDER BY distances included); non-ordered batches need
+	 * only a recheck flag per item.  Index-only scans (always non-ordered)
+	 * need extra room after the array for the reconstruction prefix (see
+	 * SpGistBatchGetReconArea and spgcanreturn).
+	 *
+	 * We do this here rather than in spgbeginscan because xs_want_itup is set
+	 * by index_beginscan only after ambeginscan returns.
+	 */
+	if (so->numberOfNonNullOrderBys > 0)
+		scan->batch_index_opaque_dyn = SpGistBatchItemArraySize(scan);
+	else
+		scan->batch_index_opaque_dyn = SpGistBatchRecheckArraySize(scan);
+	if (scan->xs_want_itup)
+		scan->batch_index_opaque_dyn += BLCKSZ;
+
 	/* set up starting queue entries */
 	resetSpGistScanOpaque(so);
 
+	/* discard any index-only tuple reconstructed by a previous scan */
+	MemoryContextReset(so->reconCxt);
+	scan->xs_hitup = NULL;
+
 	/* count an indexscan for stats */
 	pgstat_count_index_scan(scan->indexRelation);
 	if (scan->instrument)
@@ -427,6 +458,7 @@ spgendscan(IndexScanDesc scan)
 
 	MemoryContextDelete(so->tempCxt);
 	MemoryContextDelete(so->traversalCxt);
+	MemoryContextDelete(so->reconCxt);
 
 	if (so->keyData)
 		pfree(so->keyData);
@@ -455,10 +487,11 @@ spgendscan(IndexScanDesc scan)
  * Leaf SpGistSearchItem constructor, called in queue context
  */
 static SpGistSearchItem *
-spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
+spgNewHeapItem(IndexScanDesc scan, int level, SpGistLeafTuple leafTuple,
 			   Datum leafValue, bool recheck, bool recheckDistances,
 			   bool isnull, double *distances)
 {
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
 	SpGistSearchItem *item = spgAllocSearchItem(so, isnull, distances);
 
 	item->level = level;
@@ -470,7 +503,7 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
 	 * if we didn't ask it to, and mildly-broken methods might supply one of
 	 * the wrong type.  The correct leafValue type is attType not leafType.
 	 */
-	if (so->want_itup)
+	if (scan->xs_want_itup)
 	{
 		item->value = isnull ? (Datum) 0 :
 			datumCopy(leafValue, so->state.attType.attbyval,
@@ -502,16 +535,18 @@ spgNewHeapItem(SpGistScanOpaque so, int level, SpGistLeafTuple leafTuple,
 }
 
 /*
- * Test whether a leaf tuple satisfies all the scan keys
+ * Test whether a leaf tuple satisfies all the scan keys.
  *
- * *reportedSome is set to true if:
- *		the scan is not ordered AND the item satisfies the scankeys
+ * When a match is found, an ordered scan queues the heap tuple for later
+ * distance-ordered draining.  A non-ordered scan appends it to batch.
+ *
+ * 'batch' arg is NULL for ordered scans.
  */
 static bool
-spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
-			SpGistLeafTuple leafTuple, bool isnull,
-			bool *reportedSome, storeRes_func storeRes)
+spgLeafTest(IndexScanDesc scan, SpGistSearchItem *item,
+			SpGistLeafTuple leafTuple, bool isnull, IndexScanBatch batch)
 {
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
 	Datum		leafValue;
 	double	   *distances;
 	bool		result;
@@ -544,7 +579,7 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 		in.reconstructedValue = item->value;
 		in.traversalValue = item->traversalValue;
 		in.level = item->level;
-		in.returnData = so->want_itup;
+		in.returnData = scan->xs_want_itup;
 		in.leafDatum = SGLTDATUM(leafTuple, &so->state);
 
 		out.leafValue = (Datum) 0;
@@ -569,15 +604,16 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 		/* item passes the scankeys */
 		if (so->numberOfNonNullOrderBys > 0)
 		{
-			/* the scan is ordered -> add the item to the queue */
-			MemoryContext oldCxt = MemoryContextSwitchTo(so->traversalCxt);
-			SpGistSearchItem *heapItem = spgNewHeapItem(so, item->level,
-														leafTuple,
-														leafValue,
-														recheck,
-														recheckDistances,
-														isnull,
-														distances);
+			/* The scan is ordered; add the item to the queue */
+			MemoryContext oldCxt;
+			SpGistSearchItem *heapItem;
+
+			Assert(scan->batchImmediateUnguard);
+
+			oldCxt = MemoryContextSwitchTo(so->traversalCxt);
+			heapItem = spgNewHeapItem(scan, item->level, leafTuple, leafValue,
+									  recheck, recheckDistances, isnull,
+									  distances);
 
 			spgAddSearchItemToQueue(so, heapItem);
 
@@ -585,11 +621,41 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 		}
 		else
 		{
-			/* non-ordered scan, so report the item right away */
+			/*
+			 * The scan is non-ordered; add the item to caller's batch
+			 * directly.
+			 */
+			int			i = ++batch->lastItem;
+
 			Assert(!recheckDistances);
-			storeRes(so, &leafTuple->heapPtr, leafValue, isnull,
-					 leafTuple, recheck, false, NULL);
-			*reportedSome = true;
+			Assert(i < scan->maxitemsbatch);
+
+			batch->items[i].tableTid = leafTuple->heapPtr;
+			batch->items[i].indexOffset = InvalidOffsetNumber;	/* meaningless */
+			batch->items[i].tupleOffset = 0;
+
+			SpGistBatchGetRecheck(scan, batch)[i] = recheck;
+
+			if (scan->xs_want_itup)
+			{
+				Size		sz = leafTuple->size;
+				int			off = 0;
+
+				if (i > batch->firstItem)
+				{
+					int			prev = batch->items[i - 1].tupleOffset;
+
+					/*
+					 * Copy tuple to point immediately after most recently
+					 * appended tuple
+					 */
+					off = prev + ((SpGistLeafTuple) (batch->currTuples + prev))->size;
+				}
+
+				batch->items[i].tupleOffset = off;
+				memcpy(batch->currTuples + off, leafTuple, sz);
+				Assert(off + sz <= scan->batch_tuples_workspace);
+			}
 		}
 	}
 
@@ -599,10 +665,12 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 /* A bundle initializer for inner_consistent methods */
 static void
 spgInitInnerConsistentIn(spgInnerConsistentIn *in,
-						 SpGistScanOpaque so,
+						 IndexScanDesc scan,
 						 SpGistSearchItem *item,
 						 SpGistInnerTuple innerTuple)
 {
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
 	in->scankeys = so->keyData;
 	in->orderbys = so->orderByData;
 	in->nkeys = so->numberOfKeys;
@@ -612,7 +680,7 @@ spgInitInnerConsistentIn(spgInnerConsistentIn *in,
 	in->traversalMemoryContext = so->traversalCxt;
 	in->traversalValue = item->traversalValue;
 	in->level = item->level;
-	in->returnData = so->want_itup;
+	in->returnData = scan->xs_want_itup;
 	in->allTheSame = innerTuple->allTheSame;
 	in->hasPrefix = (innerTuple->prefixSize > 0);
 	in->prefixDatum = SGITDATUM(innerTuple, &so->state);
@@ -659,9 +727,10 @@ spgMakeInnerItem(SpGistScanOpaque so,
 }
 
 static void
-spgInnerTest(SpGistScanOpaque so, SpGistSearchItem *item,
+spgInnerTest(IndexScanDesc scan, SpGistSearchItem *item,
 			 SpGistInnerTuple innerTuple, bool isnull)
 {
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
 	MemoryContext oldCxt = MemoryContextSwitchTo(so->tempCxt);
 	spgInnerConsistentOut out;
 	int			nNodes = innerTuple->nNodes;
@@ -673,7 +742,7 @@ spgInnerTest(SpGistScanOpaque so, SpGistSearchItem *item,
 	{
 		spgInnerConsistentIn in;
 
-		spgInitInnerConsistentIn(&in, so, item, innerTuple);
+		spgInitInnerConsistentIn(&in, scan, item, innerTuple);
 
 		/* use user-defined inner consistent method */
 		FunctionCall2Coll(&so->innerConsistentFn,
@@ -755,12 +824,11 @@ enum SpGistSpecialOffsetNumbers
 };
 
 static OffsetNumber
-spgTestLeafTuple(SpGistScanOpaque so,
+spgTestLeafTuple(IndexScanDesc scan,
 				 SpGistSearchItem *item,
 				 Page page, OffsetNumber offset,
 				 bool isnull, bool isroot,
-				 bool *reportedSome,
-				 storeRes_func storeRes)
+				 IndexScanBatch batch)
 {
 	SpGistLeafTuple leafTuple = (SpGistLeafTuple)
 		PageGetItem(page, PageGetItemId(page, offset));
@@ -796,117 +864,91 @@ spgTestLeafTuple(SpGistScanOpaque so,
 
 	Assert(ItemPointerIsValid(&leafTuple->heapPtr));
 
-	spgLeafTest(so, item, leafTuple, isnull, reportedSome, storeRes);
+	spgLeafTest(scan, item, leafTuple, isnull, batch);
 
 	return SGLT_GET_NEXTOFFSET(leafTuple);
 }
 
 /*
- * Walk the tree and report all tuples passing the scan quals to the storeRes
- * subroutine.
+ * Walk the tree and return the next batch of matching tuples.
  *
- * If scanWholeIndex is true, we'll do just that.  If not, we'll stop at the
- * next page boundary once we have reported at least one tuple.
+ * Main driver of spgistgetbitmap and non-ordered spgistgetbatch scans.
  */
-static void
-spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
-		storeRes_func storeRes)
+static IndexScanBatch
+spgWalk(IndexScanDesc scan)
 {
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	Buffer		buffer = InvalidBuffer;
-	bool		reportedSome = false;
 
-	while (scanWholeIndex || !reportedSome)
+	batch = indexam_util_alloc_batch(scan);
+
+	/* SP-GiST only ever scans forward; set the batch's direction up front */
+	batch->dir = ForwardScanDirection;
+
+	/* Walk until a leaf page yields matches, or the index is exhausted */
+	while (batch->firstItem > batch->lastItem)
 	{
 		SpGistSearchItem *item = spgGetNextQueueItem(so);
+		Page		page;
 
 		if (item == NULL)
 			break;				/* No more items in queue -> done */
 
-redirect:
-		/* Check for interrupts, just in case of infinite loop */
-		CHECK_FOR_INTERRUPTS();
+		/* Heap items only occur in ordered scans (see spgWalkOrdered) */
+		Assert(!item->isLeaf);
 
-		if (item->isLeaf)
-		{
-			/* We store heap items in the queue only in case of ordered search */
-			Assert(so->numberOfNonNullOrderBys > 0);
-			storeRes(so, &item->heapPtr, item->value, item->isNull,
-					 item->leafTuple, item->recheck,
-					 item->recheckDistances, item->distances);
-			reportedSome = true;
-		}
+		/*
+		 * Navigate to the item's live page, then process its contents.
+		 *
+		 * Note: spgReadItemPage calls CHECK_FOR_INTERRUPTS().
+		 */
+		buffer = spgReadItemPage(scan, item, buffer);
+		page = BufferGetPage(buffer);
+
+		if (SpGistPageIsLeaf(page))
+			spgProcessLeafPage(scan, item, page, batch);
 		else
+			spgProcessInnerPage(scan, item, page);
+
+		if (batch->firstItem <= batch->lastItem)
 		{
-			BlockNumber blkno = ItemPointerGetBlockNumber(&item->heapPtr);
-			OffsetNumber offset = ItemPointerGetOffsetNumber(&item->heapPtr);
-			Page		page;
-			bool		isnull;
+			/* batch has matching items to return */
+			SpGistBatchData *sbatch = SpGistBatchGetData(scan, batch);
 
-			if (buffer == InvalidBuffer)
+			Assert(BufferIsValid(buffer));
+			Assert(SpGistPageIsLeaf(BufferGetPage(buffer)));
+
+			sbatch->buf = buffer;
+			sbatch->blkno = BufferGetBlockNumber(buffer);
+
+			if (scan->xs_want_itup)
 			{
-				buffer = ReadBuffer(index, blkno);
-				LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			}
-			else if (blkno != BufferGetBlockNumber(buffer))
-			{
-				UnlockReleaseBuffer(buffer);
-				buffer = ReadBuffer(index, blkno);
-				LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			}
+				/*
+				 * Stash the shared reconstruction prefix for spggettransform,
+				 * which runs after item is freed.  The prefix (item->value)
+				 * is the same for every match in the batch.  It can be NULL
+				 * when the opclass reconstructs entirely from the leaf datum
+				 * (e.g. quad/kd-tree) or at the root level.
+				 */
+				sbatch->level = item->level;
+				sbatch->isNull = item->isNull;
 
-			/* else new pointer points to the same page, no work needed */
-
-			page = BufferGetPage(buffer);
-
-			isnull = SpGistPageStoresNulls(page) ? true : false;
-
-			if (SpGistPageIsLeaf(page))
-			{
-				/* Page is a leaf - that is, all its tuples are heap items */
-				OffsetNumber max = PageGetMaxOffsetNumber(page);
-
-				if (SpGistBlockIsRoot(blkno))
+				if (so->state.attLeafType.attbyval || item->isNull ||
+					DatumGetPointer(item->value) == NULL)
 				{
-					/* When root is a leaf, examine all its tuples */
-					for (offset = FirstOffsetNumber; offset <= max; offset++)
-						(void) spgTestLeafTuple(so, item, page, offset,
-												isnull, true,
-												&reportedSome, storeRes);
+					sbatch->reconValue = item->value;
 				}
 				else
 				{
-					/* Normal case: just examine the chain we arrived at */
-					while (offset != InvalidOffsetNumber)
-					{
-						Assert(offset >= FirstOffsetNumber && offset <= max);
-						offset = spgTestLeafTuple(so, item, page, offset,
-												  isnull, false,
-												  &reportedSome, storeRes);
-						if (offset == SpGistRedirectOffsetNumber)
-							goto redirect;
-					}
-				}
-			}
-			else				/* page is inner */
-			{
-				SpGistInnerTuple innerTuple = (SpGistInnerTuple)
-					PageGetItem(page, PageGetItemId(page, offset));
+					/* pass-by-reference prefix: copy it into the recon area */
+					Size		sz = datumGetSize(item->value, false,
+												  so->state.attLeafType.attlen);
+					char	   *dest = SpGistBatchGetReconArea(scan, batch);
 
-				if (innerTuple->tupstate != SPGIST_LIVE)
-				{
-					if (innerTuple->tupstate == SPGIST_REDIRECT)
-					{
-						/* transfer attention to redirect point */
-						item->heapPtr = ((SpGistDeadTuple) innerTuple)->pointer;
-						Assert(ItemPointerGetBlockNumber(&item->heapPtr) !=
-							   SPGIST_METAPAGE_BLKNO);
-						goto redirect;
-					}
-					elog(ERROR, "unexpected SPGiST tuple state: %d",
-						 innerTuple->tupstate);
+					memcpy(dest, DatumGetPointer(item->value), sz);
+					sbatch->reconValue = PointerGetDatum(dest);
 				}
-
-				spgInnerTest(so, item, innerTuple, isnull);
 			}
 		}
 
@@ -916,175 +958,511 @@ redirect:
 		MemoryContextReset(so->tempCxt);
 	}
 
-	if (buffer != InvalidBuffer)
-		UnlockReleaseBuffer(buffer);
+	if (batch->firstItem > batch->lastItem)
+	{
+		/* queue exhausted without finding any matches: end of scan */
+		if (buffer != InvalidBuffer)
+			UnlockReleaseBuffer(buffer);
+		indexam_util_release_batch(scan, batch);
+		return NULL;
+	}
+
+	indexam_util_unlock_batch(scan, batch, buffer);
+
+	return batch;
 }
 
-
-/* storeRes subroutine for getbitmap case */
+/*
+ * Convert an ordered heap item's flattened distances into the batch item's
+ * IndexOrderByDistance array, honoring nonNullOrderByOffsets.
+ */
 static void
-storeBitmap(SpGistScanOpaque so, ItemPointer heapPtr,
-			Datum leafValue, bool isnull,
-			SpGistLeafTuple leafTuple, bool recheck,
-			bool recheckDistances, double *distances)
+spgFillBatchItemDistances(SpGistScanOpaque so, SpGistBatchItem *bitem,
+						  SpGistSearchItem *item)
 {
-	Assert(!recheckDistances && !distances);
-	tbm_add_tuples(so->tbm, heapPtr, 1, recheck);
-	so->ntids++;
+	if (item->isNull || so->numberOfNonNullOrderBys <= 0)
+	{
+		for (int i = 0; i < so->numberOfOrderBys; i++)
+		{
+			bitem->distances[i].value = 0.0;
+			bitem->distances[i].isnull = true;
+		}
+		return;
+	}
+
+	for (int i = 0; i < so->numberOfOrderBys; i++)
+	{
+		int			offset = so->nonNullOrderByOffsets[i];
+
+		if (offset >= 0)
+		{
+			bitem->distances[i].value = item->distances[offset];
+			bitem->distances[i].isnull = false;
+		}
+		else
+		{
+			bitem->distances[i].value = 0.0;
+			bitem->distances[i].isnull = true;
+		}
+	}
+}
+
+/*
+ * spgWalkOrdered() -- drain the distance queue into one virtual batch
+ *
+ * Pop items from so->scanQueue in (lower-bound) distance order: index pages are
+ * scanned (pushing children and matching heap tuples back onto the queue), heap
+ * tuples are appended to the batch, until the batch fills or the queue empties.
+ * The result is a "virtual" batch spanning many leaf pages, holding no pin (and
+ * never index-only, which the planner forbids for ordered scans).
+ */
+static IndexScanBatch
+spgWalkOrdered(IndexScanDesc scan)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+	IndexScanBatch batch = indexam_util_alloc_batch(scan);
+	SpGistBatchData *sbatch;
+	Buffer		buffer = InvalidBuffer;
+	int			nitems = 0;
+
+	/* SP-GiST only ever scans forward; set the batch's direction up front */
+	batch->dir = ForwardScanDirection;
+
+	for (;;)
+	{
+		SpGistSearchItem *item = spgGetNextQueueItem(so);
+
+		if (item == NULL)
+			break;				/* queue exhausted (end of scan) */
+
+		if (item->isLeaf)
+		{
+			/* matching heap tuple: append to the batch in distance order */
+			SpGistBatchItem *bitem = SpGistBatchGetItem(scan, batch, nitems);
+
+			batch->items[nitems].tableTid = item->heapPtr;
+			batch->items[nitems].indexOffset = InvalidOffsetNumber;
+			batch->items[nitems].tupleOffset = 0;
+
+			bitem->recheck = item->recheck;
+			bitem->recheckDistances = item->recheckDistances;
+			spgFillBatchItemDistances(so, bitem, item);
+
+			spgFreeSearchItem(so, item);
+			MemoryContextReset(so->tempCxt);
+
+			if (++nitems == scan->maxitemsbatch)
+				break;			/* batch full; remaining items stay queued */
+		}
+		else
+		{
+			Page		page;
+
+			/*
+			 * Index page: scan it, pushing children/heap items onto the
+			 * queue.
+			 *
+			 * Note: spgReadItemPage calls CHECK_FOR_INTERRUPTS().
+			 */
+			buffer = spgReadItemPage(scan, item, buffer);
+			page = BufferGetPage(buffer);
+
+			if (SpGistPageIsLeaf(page))
+			{
+				/* root-as-leaf: queue matching heap items (batch unused) */
+				spgProcessLeafPage(scan, item, page, NULL);
+			}
+			else
+				spgProcessInnerPage(scan, item, page);
+
+			spgFreeSearchItem(so, item);
+			MemoryContextReset(so->tempCxt);
+		}
+	}
+
+	if (buffer != InvalidBuffer)
+		UnlockReleaseBuffer(buffer);
+
+	if (nitems == 0)
+	{
+		/* no matching items remain: the scan is exhausted */
+		indexam_util_release_batch(scan, batch);
+		return NULL;
+	}
+
+	/* an ordered batch is "virtual" and holds no interlock pin */
+	sbatch = SpGistBatchGetData(scan, batch);
+	sbatch->buf = InvalidBuffer;
+	sbatch->blkno = InvalidBlockNumber;
+
+	batch->firstItem = 0;
+	batch->lastItem = nitems - 1;
+
+	Assert(!batch->isGuarded);
+
+	return batch;
+}
+
+/*
+ * Navigate to the live page that 'item' points at, following inner-tuple and
+ * leaf-head REDIRECTs.
+ *
+ * 'buffer' is the lock the caller is already holding (or InvalidBuffer).  We
+ * keep that lock while the item stays on the same block, releasing and
+ * re-acquiring only when the block changes.
+ *
+ * The returned buffer is pinned and share-locked, holding either a live inner
+ * tuple or a leaf page (whose chain head is not a redirect) at item->heapPtr.
+ */
+static Buffer
+spgReadItemPage(IndexScanDesc scan, SpGistSearchItem *item, Buffer buffer)
+{
+	Relation	index = scan->indexRelation;
+
+	Assert(!item->isLeaf);		/* heap items are handled by the caller */
+
+	for (;;)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(&item->heapPtr);
+		OffsetNumber offset = ItemPointerGetOffsetNumber(&item->heapPtr);
+		Page		page;
+
+		/* Release the page we hold if the item moved to a different block */
+		if (buffer != InvalidBuffer && blkno != BufferGetBlockNumber(buffer))
+		{
+			UnlockReleaseBuffer(buffer);
+			buffer = InvalidBuffer;
+		}
+
+		/* Acquire the page if we're not already holding it */
+		if (buffer == InvalidBuffer)
+		{
+			CHECK_FOR_INTERRUPTS();
+			buffer = ReadBuffer(index, blkno);
+			LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		}
+
+		page = BufferGetPage(buffer);
+
+		if (SpGistPageIsLeaf(page))
+		{
+			ItemId		iid;
+			SpGistLeafTuple head;
+
+			/* When root is a leaf, all its tuples are live: no redirect */
+			if (SpGistBlockIsRoot(blkno))
+				return buffer;
+
+			/*
+			 * A leaf REDIRECT is always the head of its chain; follow it to
+			 * the live tuples' page before the caller reports any match.  A
+			 * live or dead head is left for spgProcessLeafPage to deal with.
+			 */
+			iid = PageGetItemId(page, offset);
+			head = (SpGistLeafTuple) PageGetItem(page, iid);
+
+			if (head->tupstate == SPGIST_REDIRECT)
+			{
+				item->heapPtr = ((SpGistDeadTuple) head)->pointer;
+				Assert(ItemPointerGetBlockNumber(&item->heapPtr) !=
+					   SPGIST_METAPAGE_BLKNO);
+				continue;
+			}
+
+			return buffer;
+		}
+		else					/* page is inner */
+		{
+			ItemId		iid;
+			SpGistInnerTuple innerTuple;
+
+			iid = PageGetItemId(page, offset);
+			innerTuple = (SpGistInnerTuple) PageGetItem(page, iid);
+
+			if (innerTuple->tupstate != SPGIST_LIVE)
+			{
+				if (innerTuple->tupstate == SPGIST_REDIRECT)
+				{
+					/* transfer attention to redirect point */
+					item->heapPtr = ((SpGistDeadTuple) innerTuple)->pointer;
+					Assert(ItemPointerGetBlockNumber(&item->heapPtr) !=
+						   SPGIST_METAPAGE_BLKNO);
+					continue;
+				}
+				elog(ERROR, "unexpected SPGiST tuple state: %d",
+					 innerTuple->tupstate);
+			}
+
+			return buffer;
+		}
+	}
+}
+
+/*
+ * Descend a live inner tuple reached by spgReadItemPage: run inner_consistent
+ * and push the matching child nodes onto the scan queue.
+ *
+ * When we're called, buffer containing 'page' is share-locked.  The tuple at
+ * item->heapPtr must be live.
+ */
+static void
+spgProcessInnerPage(IndexScanDesc scan, SpGistSearchItem *item, Page page)
+{
+	OffsetNumber offset = ItemPointerGetOffsetNumber(&item->heapPtr);
+	ItemId		iid;
+	SpGistInnerTuple innerTuple;
+
+	Assert(!SpGistPageIsLeaf(page));
+
+	iid = PageGetItemId(page, offset);
+	innerTuple = (SpGistInnerTuple) PageGetItem(page, iid);
+	Assert(innerTuple->tupstate == SPGIST_LIVE);
+
+	spgInnerTest(scan, item, innerTuple, SpGistPageStoresNulls(page));
+}
+
+/*
+ * Examine a leaf page reached by spgReadItemPage, acting on matching tuples:
+ * a non-ordered scan appends them to batch; an ordered scan queues them, with
+ * batch NULL.
+ *
+ * When we're called, buffer containing 'page' is share-locked.
+ * spgReadItemPage must have already followed any leaf-head redirect, so the
+ * chain examined here contains no redirect.
+ */
+static void
+spgProcessLeafPage(IndexScanDesc scan, SpGistSearchItem *item, Page page,
+				   IndexScanBatch batch)
+{
+	BlockNumber blkno = ItemPointerGetBlockNumber(&item->heapPtr);
+	OffsetNumber offset = ItemPointerGetOffsetNumber(&item->heapPtr);
+	bool		isnull = SpGistPageStoresNulls(page);
+	OffsetNumber max = PageGetMaxOffsetNumber(page);
+
+	Assert(SpGistPageIsLeaf(page));
+
+	if (SpGistBlockIsRoot(blkno))
+	{
+		/* When root is a leaf, examine all its tuples */
+		for (offset = FirstOffsetNumber; offset <= max; offset++)
+			(void) spgTestLeafTuple(scan, item, page, offset,
+									isnull, true, batch);
+	}
+	else
+	{
+		/* Normal case: just examine the chain we arrived at */
+		while (offset != InvalidOffsetNumber)
+		{
+			Assert(offset >= FirstOffsetNumber && offset <= max);
+			offset = spgTestLeafTuple(scan, item, page, offset,
+									  isnull, false, batch);
+			/* spgReadItemPage already resolved any leaf-head redirect */
+			Assert(offset != SpGistRedirectOffsetNumber);
+		}
+	}
 }
 
 int64
 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+	int64		ntids = 0;
+	IndexScanBatch batch;
 
-	/* Copy want_itup to *so so we don't need to pass it around separately */
-	so->want_itup = false;
-
-	so->tbm = tbm;
-	so->ntids = 0;
-
-	spgWalk(scan->indexRelation, so, true, storeBitmap);
-
-	return so->ntids;
-}
-
-/* storeRes subroutine for gettuple case */
-static void
-storeGettuple(SpGistScanOpaque so, ItemPointer heapPtr,
-			  Datum leafValue, bool isnull,
-			  SpGistLeafTuple leafTuple, bool recheck,
-			  bool recheckDistances, double *nonNullDistances)
-{
-	Assert(so->nPtrs < MaxIndexTuplesPerPage);
-	so->heapPtrs[so->nPtrs] = *heapPtr;
-	so->recheck[so->nPtrs] = recheck;
-	so->recheckDistances[so->nPtrs] = recheckDistances;
-
-	if (so->numberOfOrderBys > 0)
+	/*
+	 * Drive spgWalk one leaf page at a time, draining each batch into the
+	 * bitmap and releasing it before fetching the next, so only one batch is
+	 * ever live (cf. hashgetbitmap).
+	 */
+	while ((batch = spgWalk(scan)) != NULL)
 	{
-		if (isnull || so->numberOfNonNullOrderBys <= 0)
-			so->distances[so->nPtrs] = NULL;
-		else
+		bool	   *recheck = SpGistBatchGetRecheck(scan, batch);
+
+		for (int i = batch->firstItem; i <= batch->lastItem; i++)
 		{
-			IndexOrderByDistance *distances = palloc_array(IndexOrderByDistance,
-														   so->numberOfOrderBys);
-			int			i;
-
-			for (i = 0; i < so->numberOfOrderBys; i++)
-			{
-				int			offset = so->nonNullOrderByOffsets[i];
-
-				if (offset >= 0)
-				{
-					/* Copy non-NULL distance value */
-					distances[i].value = nonNullDistances[offset];
-					distances[i].isnull = false;
-				}
-				else
-				{
-					/* Set distance's NULL flag. */
-					distances[i].value = 0.0;
-					distances[i].isnull = true;
-				}
-			}
-
-			so->distances[so->nPtrs] = distances;
+			tbm_add_tuples(tbm, &batch->items[i].tableTid, 1, recheck[i]);
+			ntids++;
 		}
-	}
 
-	if (so->want_itup)
-	{
 		/*
-		 * Reconstruct index data.  We have to copy the datum out of the temp
-		 * context anyway, so we may as well create the tuple here.
+		 * Return the batch to the single-slot bitmap cache, to be reused by
+		 * the next spgWalk
 		 */
-		Datum		leafDatums[INDEX_MAX_KEYS];
-		bool		leafIsnulls[INDEX_MAX_KEYS];
-
-		/* We only need to deform the old tuple if it has INCLUDE attributes */
-		if (so->state.leafTupDesc->natts > 1)
-			spgDeformLeafTuple(leafTuple, so->state.leafTupDesc,
-							   leafDatums, leafIsnulls, isnull);
-
-		leafDatums[spgKeyColumn] = leafValue;
-		leafIsnulls[spgKeyColumn] = isnull;
-
-		so->reconTups[so->nPtrs] = heap_form_tuple(so->reconTupDesc,
-												   leafDatums,
-												   leafIsnulls);
+		indexam_util_release_batch(scan, batch);
 	}
-	so->nPtrs++;
+
+	return ntids;
 }
 
-bool
-spggettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+spggetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
 
+	/*
+	 * Note: Persistent traversal state lives in so->scanQueue, so we have no
+	 * use for priorbatch here
+	 */
 	if (dir != ForwardScanDirection)
 		elog(ERROR, "SP-GiST only supports forward scan direction");
 
-	/* Copy want_itup to *so so we don't need to pass it around separately */
-	so->want_itup = scan->xs_want_itup;
+	if (so->numberOfNonNullOrderBys > 0)
+		return spgWalkOrdered(scan);
 
-	for (;;)
+	return spgWalk(scan);
+}
+
+/*
+ * spgunguardbatch() -- Drop a batch's TID recycling interlock (buffer pin)
+ */
+void
+spgunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SpGistBatchData *sbatch = SpGistBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(sbatch->buf);
+}
+
+/*
+ * spggettransform() -- Set up the scan's per-tuple output for one batch item
+ *
+ * Applies the item's recheck flag, and either reconstructs the index-only heap
+ * tuple (xs_hitup) or reports the item's ORDER BY distances.
+ */
+void
+spggettransform(IndexScanDesc scan, IndexScanBatch batch, int item)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+	Assert(item >= batch->firstItem && item <= batch->lastItem);
+
+	/* Ordered (virtual) batch: recheck flag and distances live in the item */
+	if (so->numberOfNonNullOrderBys > 0)
 	{
-		if (so->iPtr < so->nPtrs)
-		{
-			/* continuing to return reported tuples */
-			scan->xs_heaptid = so->heapPtrs[so->iPtr];
-			scan->xs_recheck = so->recheck[so->iPtr];
-			scan->xs_hitup = so->reconTups[so->iPtr];
+		SpGistBatchItem *bitem = SpGistBatchGetItem(scan, batch, item);
 
-			if (so->numberOfOrderBys > 0)
-				index_store_float8_orderby_distances(scan, so->orderByTypes,
-													 so->distances[so->iPtr],
-													 so->recheckDistances[so->iPtr]);
-			so->iPtr++;
-			return true;
-		}
+		Assert(!scan->xs_want_itup);
+		Assert(SpGistBatchGetData(scan, batch)->blkno == InvalidBlockNumber);
 
-		if (so->numberOfOrderBys > 0)
-		{
-			/* Must pfree distances to avoid memory leak */
-			int			i;
-
-			for (i = 0; i < so->nPtrs; i++)
-				if (so->distances[i])
-					pfree(so->distances[i]);
-		}
-
-		if (so->want_itup)
-		{
-			/* Must pfree reconstructed tuples to avoid memory leak */
-			int			i;
-
-			for (i = 0; i < so->nPtrs; i++)
-				pfree(so->reconTups[i]);
-		}
-		so->iPtr = so->nPtrs = 0;
-
-		spgWalk(scan->indexRelation, so, false, storeGettuple);
-
-		if (so->nPtrs == 0)
-			break;				/* must have completed scan */
+		scan->xs_recheck = bitem->recheck;
+		index_store_float8_orderby_distances(scan, so->orderByTypes,
+											 bitem->distances,
+											 bitem->recheckDistances);
+		return;
 	}
 
-	return false;
+	/* Non-ordered batch: recheck flags live in a bool array */
+	scan->xs_recheck = SpGistBatchGetRecheck(scan, batch)[item];
+
+	if (scan->xs_want_itup)
+	{
+		/* Index-only scan */
+		SpGistBatchData *sbatch = SpGistBatchGetData(scan, batch);
+		SpGistLeafTuple leafTuple;
+		Datum		leafDatums[INDEX_MAX_KEYS];
+		bool		leafIsnulls[INDEX_MAX_KEYS];
+		Datum		leafValue = (Datum) 0;
+		MemoryContext oldcxt;
+
+		Assert(scan->numberOfOrderBys == 0);
+		Assert(sbatch->blkno != InvalidBlockNumber);
+
+		/* Reconstruct the key value via leaf_consistent */
+		leafTuple = (SpGistLeafTuple) (batch->currTuples +
+									   batch->items[item].tupleOffset);
+		if (!sbatch->isNull)
+		{
+			spgLeafConsistentIn in;
+			spgLeafConsistentOut out;
+
+			oldcxt = MemoryContextSwitchTo(so->tempCxt);
+
+			in.scankeys = so->keyData;
+			in.orderbys = NULL;
+			in.nkeys = so->numberOfKeys;
+			in.norderbys = 0;
+			in.reconstructedValue = sbatch->reconValue;
+			in.traversalValue = NULL;
+			in.level = sbatch->level;
+			in.returnData = true;
+			in.leafDatum = SGLTDATUM(leafTuple, &so->state);
+
+			out.leafValue = (Datum) 0;
+			out.recheck = false;
+			out.distances = NULL;
+			out.recheckDistances = false;
+
+			(void) FunctionCall2Coll(&so->leafConsistentFn, so->indexCollation,
+									 PointerGetDatum(&in), PointerGetDatum(&out));
+			leafValue = out.leafValue;
+
+			MemoryContextSwitchTo(oldcxt);
+		}
+
+		/* free the previously returned reconstructed tuple, if any */
+		if (scan->xs_hitup)
+		{
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
+		}
+
+		/* build the returnable heap tuple in the scan-lifetime context */
+		oldcxt = MemoryContextSwitchTo(so->reconCxt);
+
+		/* Only deform the leaf tuple if it has INCLUDE attributes */
+		if (so->state.leafTupDesc->natts > 1)
+			spgDeformLeafTuple(leafTuple, so->state.leafTupDesc,
+							   leafDatums, leafIsnulls, sbatch->isNull);
+
+		leafDatums[spgKeyColumn] = leafValue;
+		leafIsnulls[spgKeyColumn] = sbatch->isNull;
+
+		scan->xs_hitup = heap_form_tuple(so->reconTupDesc,
+										 leafDatums, leafIsnulls);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		/* clean up after the leaf_consistent call */
+		MemoryContextReset(so->tempCxt);
+	}
+	else if (scan->numberOfOrderBys > 0)
+	{
+		/* all-NULL order-by arguments: report NULL distances */
+		index_store_float8_orderby_distances(scan, so->orderByTypes,
+											 NULL, false);
+	}
 }
 
 bool
 spgcanreturn(Relation index, int attno)
 {
-	SpGistCache *cache;
+	SpGistCache *cache = spgGetCache(index);
 
-	/* INCLUDE attributes can always be fetched for index-only scans */
+	/*
+	 * Forbid index-only scans for "long values" opclasses (e.g. text radix):
+	 * the key is reconstructed from the prefix accumulated during the descent
+	 * (see spggettransform) and is bounded only by the field-size limit, so
+	 * it won't fit the fixed per-batch reconstruction workspace
+	 * (spgbeginscan).
+	 */
+	if (cache->config.longValuesOK)
+		return false;
+
+	/*
+	 * else INCLUDE attributes can always be fetched for index-only scans.
+	 *
+	 * Note: We deliberately give up on INCLUDE-only index-only scans too,
+	 * even though an INCLUDE column comes straight from the bounded leaf
+	 * tuple and needs no key reconstruction: spggettransform reconstructs the
+	 * key unconditionally, and recheck of a key-column qual would need the
+	 * key value regardless.
+	 */
 	if (attno > 1)
 		return true;
 
 	/* We can do it if the opclass config function says so */
-	cache = spgGetCache(index);
-
 	return cache->config.canReturnData;
 }
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 47153b4b0..260a16490 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,11 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = spgadjustmembers,
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
-		.amgettuple = spggettuple,
-		.amgetbatch = NULL,
-		.amunguardbatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = spggetbatch,
+		.amunguardbatch = spgunguardbatch,
 		.amkillitemsbatch = NULL,
-		.amgettransform = NULL,
+		.amgettransform = spggettransform,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
 		.amposreset = NULL,
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index c461f8dc0..39b9bfcbd 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,18 @@ spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page;
 
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	/*
+	 * Get a full cleanup lock on this page.  We must get such a lock on every
+	 * leaf page over the course of the vacuum scan, whether or not it
+	 * actually contains any deletable tuples.
+	 *
+	 * Note: we could avoid this for inner pages, but not for the root page.
+	 * The root page can start out as a leaf page, but subsequently become an
+	 * inner page, even while a scan holds an interlock pin on that page (this
+	 * isn't possible in nbtree because root splits always create a new root
+	 * page, stored within a separate block number).
+	 */
+	LockBufferForCleanup(buffer);
 	page = BufferGetPage(buffer);
 
 	if (PageIsNew(page))
@@ -706,7 +717,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 		blkno = ItemPointerGetBlockNumber(&pitem->tid);
 		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
 									RBM_NORMAL, bds->info->strategy);
-		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBufferForCleanup(buffer);
 		page = BufferGetPage(buffer);
 
 		if (PageIsNew(page) || SpGistPageIsDeleted(page))
@@ -793,6 +804,16 @@ spgprocesspending(spgBulkDeleteState *bds)
 	spgClearPendingList(bds);
 }
 
+/*
+ * Chunk size for the main bulkdelete scan: the pending list is drained at each
+ * chunk boundary, the only point where the read stream is idle (see
+ * spgvacuumscan).  This trades read-ahead against memory -- a larger interval
+ * lets the stream prefetch further between resets, but lets the pending list
+ * grow larger before it is bounded.  4096 keeps prefetch effective while
+ * capping the list at a few thousand entries under heavy concurrent insertion.
+ */
+#define SPGIST_VACUUM_DRAIN_INTERVAL	4096
+
 /*
  * Perform a bulkdelete scan
  */
@@ -845,22 +866,29 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	 * delete some deletable tuples.  See more extensive comments about this
 	 * in btvacuumscan().
 	 */
+	num_pages = 0;				/* 0 forces an initial length check below */
 	for (;;)
 	{
-		/* Get the current relation length */
-		if (needLock)
-			LockRelationForExtension(index, ExclusiveLock);
-		num_pages = RelationGetNumberOfBlocks(index);
-		if (needLock)
-			UnlockRelationForExtension(index, ExclusiveLock);
-
-		/* Quit if we've scanned the whole relation */
+		/* Refresh the relation length once we have caught up to it */
 		if (p.current_blocknum >= num_pages)
-			break;
+		{
+			if (needLock)
+				LockRelationForExtension(index, ExclusiveLock);
+			num_pages = RelationGetNumberOfBlocks(index);
+			if (needLock)
+				UnlockRelationForExtension(index, ExclusiveLock);
 
-		p.last_exclusive = num_pages;
+			/* Quit if we've scanned the whole relation */
+			if (p.current_blocknum >= num_pages)
+				break;
+		}
+
+		/* Give the stream the next chunk; see SPGIST_VACUUM_DRAIN_INTERVAL */
+		if (num_pages - p.current_blocknum > SPGIST_VACUUM_DRAIN_INTERVAL)
+			p.last_exclusive = p.current_blocknum + SPGIST_VACUUM_DRAIN_INTERVAL;
+		else
+			p.last_exclusive = num_pages;
 
-		/* Iterate over pages, then loop back to recheck length */
 		while (true)
 		{
 			Buffer		buf;
@@ -874,18 +902,18 @@ spgvacuumscan(spgBulkDeleteState *bds)
 				break;
 
 			spgvacuumpage(bds, buf);
-
-			/* empty the pending-list after each page */
-			if (bds->pendingList != NULL)
-				spgprocesspending(bds);
 		}
 
 		/*
-		 * We have to reset the read stream to use it again. After returning
-		 * InvalidBuffer, the read stream API won't invoke our callback again
-		 * until the stream has been reset.
+		 * Reset the read stream for the next chunk (after returning
+		 * InvalidBuffer it won't call our callback again until reset).  Now
+		 * that it is idle, drain the pending list: spgprocesspending revisits
+		 * redirect-relocated tuples under the same cleanup-lock interlock.
 		 */
 		read_stream_reset(stream);
+
+		if (bds->pendingList != NULL)
+			spgprocesspending(bds);
 	}
 
 	read_stream_end(stream);
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 55e8066a7..0eac1d1a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -771,7 +771,9 @@ spgRedoVacuumLeaf(XLogReaderState *record)
 	ptr += sizeof(OffsetNumber) * xldata->nChain;
 	chainDest = (OffsetNumber *) ptr;
 
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	/* We must take a cleanup lock here, just like spgvacuumpage() */
+	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
+		== BLK_NEEDS_REDO)
 	{
 		page = BufferGetPage(buffer);
 
@@ -834,7 +836,9 @@ spgRedoVacuumRoot(XLogReaderState *record)
 
 	toDelete = xldata->offsets;
 
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	/* Take a cleanup lock, as in spgRedoVacuumLeaf() */
+	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
+		== BLK_NEEDS_REDO)
 	{
 		page = BufferGetPage(buffer);
 
@@ -873,7 +877,9 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 											locator);
 	}
 
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	/* Take a cleanup lock, as in spgRedoVacuumLeaf() */
+	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
+		== BLK_NEEDS_REDO)
 	{
 		Page		page = BufferGetPage(buffer);
 		SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 75c0704cc..490431f70 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -951,7 +951,7 @@ amgetbatch (IndexScanDesc scan,
    value in the batch and provides an <function>amgettransform</function>
    callback (see below), which the table AM invokes for each returned item to
    set <literal>scan-&gt;xs_recheck</literal> from that recorded state; GiST
-   works this way.
+   and SP-GiST work this way.
   </para>
 
   <para>
@@ -986,8 +986,8 @@ amgetbatch (IndexScanDesc scan,
       <literal>scan-&gt;xs_hitupdesc</literal>).  This gives the access method
       complete freedom to form that tuple from whatever it stored in
       <structfield>currTuples</structfield>, in whatever on-disk format suits
-      it.  GiST uses this path, because the representation it
-      stores differs from the indexed value and so could not satisfy the
+      it.  GiST and SP-GiST use this path, because the representation they
+      store differs from the indexed value and so could not satisfy the
       <literal>xs_itupdesc</literal> layout directly.
      </para>
     </listitem>
@@ -1031,10 +1031,10 @@ amunguardbatch (IndexScanDesc scan,
    is not even required to use the standard helper
    <function>indexam_util_unlock_batch</function> to manage it.  In practice,
    though, most or all index AMs will use that helper and hold the simplest
-   possible interlock: each guarded B-tree, hash, or GiST batch keeps a
-   single buffer pin on the one index page the batch came from.  (The
-   <quote>virtual</quote> nearest-neighbor batches that GiST uses for ordered
-   scans are not guarded, and hold no such pin.)  See <xref
+   possible interlock: each guarded B-tree, hash, GiST, or SP-GiST batch keeps
+   a single buffer pin on the one index page the batch came from.  (The
+   <quote>virtual</quote> nearest-neighbor batches that GiST and SP-GiST use
+   for ordered scans are not guarded, and hold no such pin.)  See <xref
     linkend="index-locking"/> for details on buffer pin management during
    index scans.  This function will be called at most once for each guarded
    batch; it is not called when the index AM has already unguarded the batch
@@ -1078,10 +1078,12 @@ amkillitemsbatch (IndexScanDesc scan,
    <function>amgetbatch</function> index AMs (those that don't can leave
    the field set to <literal>NULL</literal>), but doing so is recommended for
    performance, as it allows future scans to skip known-dead index entries.
-   All three core index access methods that currently support
-   <function>amgetbatch</function> (B-tree, hash, and GiST) implement
-   <literal>LP_DEAD</literal> marking, though third-party index access methods
-   are free to choose whether to implement this feature.  The table AM may
+   B-tree, hash, and GiST implement <literal>LP_DEAD</literal> marking; SP-GiST
+   is an example of a core <function>amgetbatch</function> access method that
+   leaves it unimplemented (it still holds the leaf-page interlock pin for
+   index-only scans, but never sets <literal>LP_DEAD</literal> bits), and
+   third-party index access methods are likewise free to choose whether to
+   implement this feature.  The table AM may
    call <function>tableam_util_scanpos_killitem</function> to mark dead items as
    the scan progresses.  If the batch contains any such dead items, the batch's
    <structfield>deadItems</structfield> array will have been sorted and
@@ -1188,9 +1190,9 @@ amgettransform (IndexScanDesc scan,
    property of the whole scan &mdash; or, for index-only scans, is the on-disk
    index tuple returned directly via <literal>scan-&gt;xs_itup</literal> &mdash;
    the field can be left <literal>NULL</literal>, as B-tree and hash do.  GiST
-   provides one because parts of its per-tuple output (the recheck flag, the
-   <literal>ORDER BY</literal> distances, and the reconstructed index-only
-   tuples) vary per matching item, as described above.
+   and SP-GiST provide one because parts of their per-tuple output (the recheck
+   flag, the <literal>ORDER BY</literal> distances, and the reconstructed
+   index-only tuples) vary per matching item, as described above.
   </para>
 
   <para>
@@ -1498,7 +1500,7 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    calls <function>amgettransform</function> as it returns each item to set
    <structfield>xs_orderbyvals</structfield> and
    <structfield>xs_recheckorderby</structfield> from that recorded state.  GiST
-   uses this for nearest-neighbor scans.  As with
+   and SP-GiST use this for nearest-neighbor scans.  As with
    <literal>scan-&gt;xs_recheck</literal>, these values cannot be set directly as
    items are returned.
   </para>
@@ -1506,10 +1508,10 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   <para>
    Scans that use ordering operators are never planned as index-only scans.
    Because an ordered scan can collect matching items from many index leaf
-   pages without retaining a buffer pin on any of them (GiST's
-   <quote>virtual</quote> nearest-neighbor batches work this way), it has no
-   pin to serve as the interlock against concurrent TID recycling that an
-   index-only scan depends on (see <xref linkend="index-locking"/>).  The
+   pages without retaining a buffer pin on any of them (the
+   <quote>virtual</quote> nearest-neighbor batches of GiST and SP-GiST work
+   this way), it has no pin to serve as the interlock against concurrent TID
+   recycling that an index-only scan depends on (see <xref linkend="index-locking"/>).  The
    planner therefore costs and executes such scans as plain index scans, which
    always fetch and recheck the heap tuple.
   </para>
diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index 6af93719b..0011a458b 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -336,7 +336,12 @@ typedef struct spgConfigOut
       <structfield>longValuesOK</structfield> should be set true only when the
       <structfield>attType</structfield> is of variable length and the operator
       class is capable of segmenting long values by repeated suffixing
-      (see <xref linkend="spgist-limits"/>).
+      (see <xref linkend="spgist-limits"/>).  Setting it true disables
+      index-only scans for the operator class, even if
+      <structfield>canReturnData</structfield> is also set: reconstructing the
+      indexed value can then require materializing an arbitrarily large prefix
+      (a long value is stored by spreading its prefix across many inner tuples),
+      so such queries are executed as regular index scans instead.
      </para>
 
      <para>
diff --git a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
index 1ee65ede2..ae0ef9933 100644
--- a/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
+++ b/src/test/modules/spgist_name_ops/expected/spgist_name_ops.out
@@ -41,7 +41,7 @@ select * from t
 ---------------------------------------------------------------------------------------------------
  Sort
    Sort Key: f1
-   ->  Index Only Scan using t_f1_f2_f3_idx on t
+   ->  Index Scan using t_f1_f2_f3_idx on t
          Index Cond: ((f1 > 'binary_upgrade_set_n'::name) AND (f1 < 'binary_upgrade_set_p'::name))
 (4 rows)
 
@@ -90,7 +90,7 @@ select * from t
 ---------------------------------------------------------------------------------------------------
  Sort
    Sort Key: f1
-   ->  Index Only Scan using t_f1_f2_f3_idx on t
+   ->  Index Scan using t_f1_f2_f3_idx on t
          Index Cond: ((f1 > 'binary_upgrade_set_n'::name) AND (f1 < 'binary_upgrade_set_p'::name))
 (4 rows)
 
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c6..2d26f7f99 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -101,7 +101,7 @@ select prop,
  nulls_last         | t     | f    | f    | f            | f           | f   | f
  orderable          | t     | f    | f    | f            | f           | f   | f
  distance_orderable | f     | f    | t    | f            | t           | f   | f
- returnable         | t     | f    | f    | t            | t           | f   | f
+ returnable         | t     | f    | f    | f            | t           | f   | f
  search_array       | t     | f    | f    | f            | f           | f   | f
  search_nulls       | t     | f    | t    | t            | t           | f   | t
  bogus              |       |      |      |              |             |     | 
diff --git a/src/test/regress/expected/create_index_spgist.out b/src/test/regress/expected/create_index_spgist.out
index ddffca2e7..da730dd47 100644
--- a/src/test/regress/expected/create_index_spgist.out
+++ b/src/test/regress/expected/create_index_spgist.out
@@ -602,10 +602,10 @@ FROM (VALUES (point '1,2'), (NULL), ('1234,5678')) pts(pt);
 
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM radix_text_tbl WHERE t = 'P0123456789abcdef';
-                         QUERY PLAN                         
-------------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t = 'P0123456789abcdef'::text)
 (3 rows)
 
@@ -617,10 +617,10 @@ SELECT count(*) FROM radix_text_tbl WHERE t = 'P0123456789abcdef';
 
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM radix_text_tbl WHERE t = 'P0123456789abcde';
-                         QUERY PLAN                         
-------------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t = 'P0123456789abcde'::text)
 (3 rows)
 
@@ -632,10 +632,10 @@ SELECT count(*) FROM radix_text_tbl WHERE t = 'P0123456789abcde';
 
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM radix_text_tbl WHERE t = 'P0123456789abcdefF';
-                         QUERY PLAN                         
-------------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t = 'P0123456789abcdefF'::text)
 (3 rows)
 
@@ -650,7 +650,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t <    'Aztec
                               QUERY PLAN                              
 ----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t < 'Aztec                         Ct  '::text)
 (3 rows)
 
@@ -665,7 +665,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t ~<~  'Aztec
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ~<~ 'Aztec                         Ct  '::text)
 (3 rows)
 
@@ -680,7 +680,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t <=   'Aztec
                               QUERY PLAN                               
 -----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t <= 'Aztec                         Ct  '::text)
 (3 rows)
 
@@ -695,7 +695,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t ~<=~ 'Aztec
                                QUERY PLAN                                
 -------------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ~<=~ 'Aztec                         Ct  '::text)
 (3 rows)
 
@@ -710,7 +710,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t =    'Aztec
                               QUERY PLAN                              
 ----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t = 'Aztec                         Ct  '::text)
 (3 rows)
 
@@ -725,7 +725,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t =    'Worth
                               QUERY PLAN                              
 ----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t = 'Worth                         St  '::text)
 (3 rows)
 
@@ -740,7 +740,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t >=   'Worth
                               QUERY PLAN                               
 -----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t >= 'Worth                         St  '::text)
 (3 rows)
 
@@ -755,7 +755,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t ~>=~ 'Worth
                                QUERY PLAN                                
 -------------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ~>=~ 'Worth                         St  '::text)
 (3 rows)
 
@@ -770,7 +770,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t >    'Worth
                               QUERY PLAN                              
 ----------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t > 'Worth                         St  '::text)
 (3 rows)
 
@@ -785,7 +785,7 @@ SELECT count(*) FROM radix_text_tbl WHERE t ~>~  'Worth
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ~>~ 'Worth                         St  '::text)
 (3 rows)
 
@@ -797,10 +797,10 @@ SELECT count(*) FROM radix_text_tbl WHERE t ~>~  'Worth
 
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM radix_text_tbl WHERE t ^@	 'Worth';
-                         QUERY PLAN                         
-------------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ^@ 'Worth'::text)
 (3 rows)
 
@@ -812,10 +812,10 @@ SELECT count(*) FROM radix_text_tbl WHERE t ^@	 'Worth';
 
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM radix_text_tbl WHERE starts_with(t, 'Worth');
-                         QUERY PLAN                         
-------------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  Aggregate
-   ->  Index Only Scan using sp_radix_ind on radix_text_tbl
+   ->  Index Scan using sp_radix_ind on radix_text_tbl
          Index Cond: (t ^@ 'Worth'::text)
          Filter: starts_with(t, 'Worth'::text)
 (4 rows)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3ab27607..cbfcde303 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2971,6 +2971,8 @@ SortSupportData
 SortTuple
 SortTupleComparator
 SortedPoint
+SpGistBatchData
+SpGistBatchItem
 SpGistBuildState
 SpGistCache
 SpGistDeadTuple
@@ -4345,7 +4347,6 @@ standard_qp_extra
 stemmer_module
 stmtCacheEntry
 storeInfo
-storeRes_func
 stream_stop_callback
 string
 substitute_actual_parameters_context
-- 
2.53.0



  [application/octet-stream] v28-0007-heapam-Optimize-pin-transfers-during-index-scans.patch (6.2K, 9-v28-0007-heapam-Optimize-pin-transfers-during-index-scans.patch)
  download | inline diff:
From e409526a007dff663b65f6deeeeec09190dbb6f3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sun, 22 Mar 2026 02:22:06 -0400
Subject: [PATCH v28 07/11] heapam: Optimize pin transfers during index scans.

Add an xs_lastinblock flag to IndexScanHeapData, to track whether the
current item's heap block differs from the next item's heap block
("next" in terms of the current scan direction).  When these adjacent
blocks differ, heapam_index_heap_fetch will transfer its buffer pin to
its table slot instead of incrementing the pin count.  This avoids an
immediate IncrBufferRefCount call.  It also avoids a ReleaseBuffer call
later on, during the next call to heapam_index_heap_fetch (when the scan
has to return the aforementioned "next" item).

Also add an explicit ExecClearTuple to the block-switch path in
heapam_index_heap_fetch to release the pin on the slot (which is often
the pin transferred to the slot during the previous call).  This fixes a
performance problem where GetPrivateRefCountEntrySlow is called more
often than one would hope.  The underlying issue has been tied to the
pin in the slot being held, even if we decide to release the buffer and
move on: ExecStoreBufferHeapTuple will first fail to hit the
backend-local cache for the release of the old pin (because we just
pinned and locked the new buffer), causing a cache miss.

Author: Peter Geoghegan <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=D4Lru9BkvqaRnFRPDaZbfTOdWcxw13zyG6GVFTtz_vw@mail.gmail.com
---
 src/include/access/heapam.h                |  3 +
 src/backend/access/heap/heapam_indexscan.c | 73 +++++++++++++++++++++-
 2 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5364ce27b..71b6420c9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -132,6 +132,9 @@ typedef struct IndexScanHeapData
 
 	bool		xs_readonly;	/* scan is read-only? */
 
+	/* Plain index scan xs_lastinblock optimization */
+	bool		xs_lastinblock; /* last TID on this block in current batch? */
+
 	uint16		xs_blkswitch_count; /* number of heap blocks fetched */
 
 	/* Per-tuple context for padding "name" columns during index-only scans */
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 5f04041df..e9b1ea851 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -154,6 +154,9 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 	/* Remember if scan is read-only */
 	hscan->xs_readonly = (flags & SO_HINT_REL_READ_ONLY) != 0;
 
+	/* xs_lastinblock optimization state */
+	Assert(!hscan->xs_lastinblock);
+
 	/* Resolve which xs_getnext_slot implementation to use for this scan */
 	if (scan->indexRelation->rd_indam->amgetbatch != NULL)
 	{
@@ -788,6 +791,15 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 		 */
 		hscan->xs_blkswitch_count++;
 
+		/*
+		 * Drop the xs_blk pin independently held on by slot (if any) now,
+		 * before calling ReleaseBuffer.  This avoids expensive calls to
+		 * GetPrivateRefCountEntrySlow caused by ExecStoreBufferHeapTuple
+		 * failing to hit the backend's cache for the release of the old pin.
+		 */
+		if (!index_only)
+			ExecClearTuple(slot);
+
 		if (BufferIsValid(hscan->xs_cbuf))
 			ReleaseBuffer(hscan->xs_cbuf);
 
@@ -826,7 +838,36 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 			*heap_continue = !scan->MVCCScan;
 
 			slot->tts_tableOid = RelationGetRelid(rel);
-			ExecStoreBufferHeapTuple(heapTuple, slot, hscan->xs_cbuf);
+
+			/*
+			 * If xs_lastinblock indicates that `tid` is the last TID on the
+			 * current heap block, transfer our buffer pin to the slot rather
+			 * than having the slot increment the pin count.  This saves a
+			 * pair of IncrBufferRefCount and ReleaseBuffer calls, since the
+			 * next call here would just release the same pin on xs_cbuf
+			 * anyway. (Actually, this is only true if you assume that the
+			 * scan will continue in the current direction, but it generally
+			 * does.  An incorrect prediction costs us little.)
+			 *
+			 * We can only safely do this when heap_continue is false, since
+			 * otherwise the caller will need xs_cbuf to remain valid for the
+			 * next call.
+			 */
+			if (hscan->xs_lastinblock && !*heap_continue)
+			{
+				ExecStorePinnedBufferHeapTuple(heapTuple, slot, hscan->xs_cbuf);
+				hscan->xs_cbuf = InvalidBuffer;
+				hscan->xs_blk = InvalidBlockNumber;
+
+				/*
+				 * Note: the pin now owned by the slot is expected to be
+				 * released on the next call here, via an explicit
+				 * ExecClearTuple.  This avoids churn in the backend's private
+				 * refcount cache.
+				 */
+			}
+			else
+				ExecStoreBufferHeapTuple(heapTuple, slot, hscan->xs_cbuf);
 		}
 		else
 		{
@@ -975,10 +1016,40 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan,
 
 	if (all_visible == NULL)
 	{
+		int			nextItem;
+		bool		hasNext;
+
 		/*
 		 * Plain index scan.
+		 *
+		 * Set xs_lastinblock to indicate whether the next item in the current
+		 * scan direction is on a different heap block to the current item.
+		 * heapam_index_heap_fetch will apply this information about
+		 * scanPos.item's tableTID before we return to the core executor.
+		 *
+		 * Note: We don't set this for index-only scans because it doesn't seem
+		 * worth the trouble of reasoning about all-visible items.
+		 *
+		 * Note: We deliberately don't consider the batch after scanBatch,
+		 * because doing so would add complexity for little benefit.  It's
+		 * okay if xs_lastinblock is spuriously set to false.
 		 */
 		Assert(!scan->xs_want_itup);
+		if (ScanDirectionIsForward(direction))
+		{
+			nextItem = scanPos->item + 1;
+			hasNext = (nextItem <= scanBatch->lastItem);
+		}
+		else
+		{
+			nextItem = scanPos->item - 1;
+			hasNext = (nextItem >= scanBatch->firstItem);
+		}
+
+		hscan->xs_lastinblock = hasNext &&
+			ItemPointerGetBlockNumber(&scanBatch->items[nextItem].tableTid) !=
+			ItemPointerGetBlockNumber(&scan->xs_heaptid);
+
 		return &scan->xs_heaptid;
 	}
 
-- 
2.53.0



  [application/octet-stream] v28-0004-Adopt-amgetbatch-interface-in-hash-index-AM.patch (47.4K, 10-v28-0004-Adopt-amgetbatch-interface-in-hash-index-AM.patch)
  download | inline diff:
From 2e74accffed300a8f544ca65c51547a5f0372641 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v28 04/11] Adopt amgetbatch interface in hash index AM.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface added by commit FIXME.  Plain index scans of
hash indexes now return matching items in batches consisting of all of
the matches from a given bucket page or overflow page.  This gives the
table AM the ability to perform optimizations like index prefetching
during hash index scans.

The amgetbatch interface requires that index AMs take the same
standardized approach to pin management for pins that are used to
prevent unsafe concurrent TID recycling by VACUUM (that way prefetching
can hold open multiple batches without it affecting the read stream).
Note, however, that hash still holds on to pins needed for its own
internal purposes (e.g., it'll still hold onto a pin during a bucket
split).

hashkillitemsbatch (the hash implementation of the new amkillitemsbatch
interface) performs LP_DEAD marking of dead index entries, while
following slightly different rules to the old approach.  It relies on
comparing the batch's saved LSN against the current page LSN to detect
concurrent page modifications, which in turn requires fake LSN support
for unlogged relations.  Preparatory commit e5836f7b added that support
to the hash index AM.

Author: Peter Geoghegan <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  81 ++-----
 src/backend/access/hash/README       |  31 +--
 src/backend/access/hash/hash.c       | 210 ++++++++++------
 src/backend/access/hash/hash_xlog.c  |   4 +-
 src/backend/access/hash/hashpage.c   |  21 +-
 src/backend/access/hash/hashsearch.c | 345 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   | 129 +---------
 doc/src/sgml/indexam.sgml            |  30 ++-
 src/tools/pgindent/typedefs.list     |   3 +-
 9 files changed, 344 insertions(+), 510 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..fe6048422 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -18,6 +18,7 @@
 #define HASH_H
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "access/sdir.h"
 #include "catalog/pg_am_d.h"
@@ -100,57 +101,18 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
+/* Per-batch data private to the hash index AM */
+typedef struct HashBatchData
 {
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
+	Buffer		buf;			/* index page's buffer pin */
+	BlockNumber currPage;		/* index page's block number */
+	BlockNumber prevPage;		/* index page's left sibling */
+	BlockNumber nextPage;		/* index page's right sibling */
+} HashBatchData;
 
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
+/* Access the hash-private per-batch data from an IndexScanBatch pointer */
+#define HashBatchGetData(scan, batch) \
+	index_scan_batch_index_opaque_static(scan, batch, HashBatchData)
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -178,15 +140,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +321,15 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch hashgetbatch(IndexScanDesc scan,
+								   IndexScanBatch priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +402,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 IndexScanBatch priorbatch);
+extern IndexScanBatch _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +434,6 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index fc9031117..972bb666b 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -255,28 +255,29 @@ The reader algorithm is:
 		retake the buffer content lock on new bucket
 		arrange to scan the old bucket normally and the new bucket for
          tuples which are not moved-by-split
--- then, per read request:
+-- then, per batch (page) request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
 	the pin on the primary bucket throughout the scan)
-	save all the matching tuples from current index page into an items array
-	release pin and content lock (but if it is primary bucket page retain
-	its pin till the end of the scan)
-	get tuple from an item array
+	save all the matching tuples from current index page into a batch
+	release content lock on current page return batch to table AM (table AM
+	will drop batch's buffer pin, though primary bucket page pin is kept
+	until the end of the scan)
 -- at scan shutdown:
-	release all pins still held
+	release scan-owned pins (e.g., primary bucket page pin) as needed
 
 Holding the buffer pin on the primary bucket page for the whole scan prevents
-the reader's current-tuple pointer from being invalidated by splits or
-compactions.  (Of course, other buckets can still be split or compacted.)
+the bucket from being reorganized by splits or compactions while the scan is
+in progress.  (Of course, other buckets can still be split or compacted.)
 
-To minimize lock/unlock traffic, hash index scan always searches the entire
-hash page to identify all the matching items at once, copying their heap tuple
-IDs into backend-local storage. The heap tuple IDs are then processed while not
-holding any page lock within the index thereby, allowing concurrent insertion
-to happen on the same index page without any requirement of re-finding the
-current scan position for the reader. We do continue to hold a pin on the
-bucket page, to protect against concurrent deletions and bucket split.
+To minimize lock/unlock traffic, hash index scans always search the entire
+hash page to identify all the matching items at once, returning them in
+batches to the table AM.  The table AM processes batches while no page lock
+is held within the index, allowing concurrent insertion to happen on the
+same index page without any requirement of re-finding the current scan
+position for the reader.  The table AM controls when batch buffer pins are
+dropped.  We do continue to hold a pin on the primary bucket page, to
+protect against concurrent bucket splits.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 540f2bcd4..76e3193d9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -114,10 +114,10 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amunguardbatch = NULL,
-		.amkillitemsbatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amunguardbatch = hashunguardbatch,
+		.amkillitemsbatch = hashkillitemsbatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -300,53 +300,28 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+hashgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
-	scan->xs_recheck = true;
+	Assert(scan->xs_recheck);
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
+		Relation	rel = scan->indexRelation;
 
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
+		_hash_dropscanbuf(rel, so);
 
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -356,26 +331,26 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (int itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].tableTid, 1, true);
+			ntids++;
+		}
 
 		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
+		 * _hash_next releases the prior batch for bitmap callers before
+		 * allocating the next one, so only one batch is ever used at a time
 		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -397,17 +372,17 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->xs_recheck = true;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
+	scan->batch_index_opaque_static = MAXALIGN(sizeof(HashBatchData));
+	scan->batch_tuples_workspace = 0;
 
 	return scan;
 }
@@ -422,18 +397,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -442,6 +407,108 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashunguardbatch() -- Drop batch's TID recycling interlock (buffer pin)
+ *
+ * Called by the table AM when it's safe to drop the buffer pin held to
+ * prevent concurrent TID recycling by VACUUM.
+ */
+void
+hashunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(hashbatch->buf);
+}
+
+/*
+ *	hashkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+hashkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	bool		killedsomething = false;
+	XLogRecPtr	latestlsn;
+
+	Assert(batch->numDead > 0);
+
+	buf = _hash_getbuf(rel, hashbatch->currPage, HASH_READ,
+					   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_hash_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = HashPageGetOpaque(page);
+
+	/* Iterate through batch->deadItems[] in index page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
+		OffsetNumber offnum = currItem->indexOffset;
+		ItemId		iid = PageGetItemId(page, offnum);
+
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
+		Assert(i == 0 ||
+			   offnum > batch->items[batch->deadItems[i - 1]].indexOffset);
+		Assert(offnum <= PageGetMaxOffsetNumber(page));
+		Assert(ItemPointerEquals(&((IndexTuple) PageGetItem(page, iid))->t_tid,
+								 &currItem->tableTid));
+
+		/* Mark index item as dead, if it isn't already */
+		if (!ItemIdIsDead(iid))
+		{
+			if (!killedsomething)
+			{
+				/*
+				 * Use the hint bit infrastructure to check if we can update
+				 * the page while just holding a share lock. If we are not
+				 * allowed, there's no point continuing.
+				 */
+				if (!BufferBeginSetHintBits(buf))
+				{
+					_hash_relbuf(rel, buf);
+					return;
+				}
+			}
+
+			/* found the item */
+			ItemIdMarkDead(iid);
+			killedsomething = true;
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint. Whenever
+	 * we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+	_hash_relbuf(rel, buf);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -451,17 +518,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 2060620c7..e26ee8bb9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1141,14 +1141,14 @@ hash_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In hash bucket and overflow pages, it is possible to modify the
 		 * LP_FLAGS without emitting any WAL record. Hence, mask the line
-		 * pointer flags. See hashgettuple(), _hash_kill_items() for details.
+		 * pointer flags. See hashkillitemsbatch() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * It is possible that the hint bit LH_PAGE_HAS_DEAD_TUPLES may remain
-	 * unlogged. So, mask it. See _hash_kill_items() for details.
+	 * unlogged. So, mask it. See hashkillitemsbatch() for details.
 	 */
 	opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8099b0d02..11e3db472 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
-	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page */
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 89d1c5bc6..5a58c040b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,105 +22,94 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   IndexScanBatch batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   IndexScanBatch batch);
+static inline void _hash_saveitem(IndexScanBatch batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_hash_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
+	HashBatchData *hashpriorbatch = HashBatchGetData(scan, priorbatch);
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	IndexScanBatch batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * The core code must deal with cross-batch scan direction changes for us.
+	 * A batch management routine that flips priorbatch's scan direction is
+	 * used for this.
+	 */
+	Assert(priorbatch->dir == dir);
+
+	/*
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		blkno = hashpriorbatch->nextPage;
+	else
+		blkno = hashpriorbatch->prevPage;
 
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
-	}
+	/*
+	 * For bitmap scan callers, release the prior batch now so that the
+	 * allocation below can reuse its memory.  That way bitmap scans never
+	 * need more than one batch allocation.
+	 */
+	if (!scan->usebatchring)
+		indexam_util_release_batch(scan, priorbatch);
+
+	if (!BlockNumberIsValid(blkno))
+		return NULL;
+
+	/* Allocate space for next batch */
+	batch = indexam_util_alloc_batch(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
-		}
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	if (end_of_scan)
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_release_batch(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -270,22 +259,20 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin.
  */
-bool
+IndexScanBatch
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -296,7 +283,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	IndexScanBatch batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -326,7 +313,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -419,191 +406,152 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_alloc_batch(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_release_batch(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
+	HashBatchData *hashbatch = HashBatchGetData(scan, batch);
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	hashbatch->buf = buf;
+	hashbatch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
-			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+			if (!BufferIsValid(buf))
 				return false;
-			}
+
+			hashbatch->buf = buf;
+			hashbatch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (hashbatch->buf == so->hashso_bucket_buf ||
+		hashbatch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets its own buffer
+		 * reference that can be independently released by hashunguardbatch.
+		 * The original hashso_bucket_buf/hashso_split_bucket_buf references
+		 * belong to us.
+		 */
+		IncrBufferRefCount(hashbatch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		hashbatch->prevPage = InvalidBlockNumber;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		hashbatch->prevPage = opaque->hasho_prevblkno;
+		hashbatch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* we saved one or more matches in batch.items[] */
+	indexam_util_unlock_batch(scan, batch, hashbatch->buf);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   IndexScanBatch batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -640,7 +588,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -687,7 +635,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -706,13 +654,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(IndexScanBatch batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
-	currItem->heapTid = itup->t_tid;
+	currItem->tableTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 1d1b05f87..88a3f3ad8 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -16,10 +16,8 @@
 
 #include "access/hash.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "port/pg_bitutils.h"
 #include "utils/lsyscache.h"
-#include "utils/rel.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -33,7 +31,7 @@ _hash_checkqual(IndexScanDesc scan, IndexTuple itup)
 	/*
 	 * Currently, we can't check any of the scan conditions since we do not
 	 * have the original index entry value to supply to the sk_func. Always
-	 * return true; we expect that hashgettuple already set the recheck flag
+	 * return true; we expect that hashgetbatch already set the recheck flag
 	 * to make the main indexscan code do it.
 	 */
 #ifdef NOT_USED
@@ -505,128 +503,3 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
-
-/*
- * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
- * told us were killed.
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
- *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
- */
-void
-_hash_kill_items(IndexScanDesc scan)
-{
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber offnum,
-				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
-	bool		killedsomething = false;
-	bool		havePin = false;
-
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
-
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
-	{
-		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
-		 */
-		havePin = true;
-		buf = so->currPos.buf;
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-	}
-	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-
-	page = BufferGetPage(buf);
-	opaque = HashPageGetOpaque(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
-
-		offnum = currItem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-
-			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(buf))
-						goto unlock_page;
-				}
-
-				/* found the item */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint. Whenever
-	 * we mark anything LP_DEAD, we also set the page's
-	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
-	 */
-	if (killedsomething)
-	{
-		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		BufferFinishSetHintBits(buf, true, true);
-	}
-
-unlock_page:
-	if (havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-	else
-		_hash_relbuf(rel, buf);
-}
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8725fa36f..6e1e51169 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -735,7 +735,8 @@ ambeginscan (Relation indexRelation,
    <literal>scan-&gt;xs_recheck</literal> here, in
    <function>ambeginscan</function>, since the value applies to every item the
    scan returns.  The value set here persists across any subsequent
-   <function>amrescan</function> calls.  B-tree (always false) works this way.
+   <function>amrescan</function> calls.  B-tree (always false) and hash (always
+   true) work this way.
   </para>
 
   <para>
@@ -880,7 +881,8 @@ amgetbatch (IndexScanDesc scan,
    method.  It is up to the table AM caller to decide when it should be
    released.  Note also that <function>amgetbatch</function> functions must
    never modify the <literal>priorbatch</literal> parameter.  The core
-   <filename>src/backend/access/nbtree/</filename> implementation provides a
+   <filename>src/backend/access/nbtree/</filename> and
+   <filename>src/backend/access/hash/</filename> implementations provide
    reference examples of the <function>amgetbatch</function> interface.
   </para>
 
@@ -938,8 +940,8 @@ amunguardbatch (IndexScanDesc scan,
    is not even required to use the standard helper
    <function>indexam_util_unlock_batch</function> to manage it.  In practice,
    though, most or all index AMs will use that helper and hold the simplest
-   possible interlock: each guarded B-tree batch keeps a single buffer pin
-   on the one index page the batch came from.  See <xref
+   possible interlock: each guarded B-tree or hash batch keeps a single
+   buffer pin on the one index page the batch came from.  See <xref
     linkend="index-locking"/> for details on buffer pin management during
    index scans.  This function will be called at most once for each guarded
    batch; it is not called when the index AM has already unguarded the batch
@@ -950,10 +952,11 @@ amunguardbatch (IndexScanDesc scan,
   <note>
    <para>
     The index AM may choose to retain its own buffer pins when this serves an
-    internal purpose (for example, maintaining a descent stack of pinned index
-    pages for reuse across <function>amgetbatch</function> calls).  However,
-    any scheme that retains buffer pins managed by the index AM must be sure
-    to free the pins at an opportune point (at a minimum whenever
+    internal purpose (for example, the hash access method keeps a pin on the
+    scan's primary bucket page for the duration of the scan, which blocks the
+    concurrent bucket splits and compactions that would otherwise disrupt it).
+    However, any scheme that retains buffer pins managed by the index AM must
+    be sure to free the pins at an opportune point (at a minimum whenever
     <function>amendscan</function> is called, and typically when
     <function>amrescan</function> is called).  It must also keep the number of
     retained pins fixed and small.
@@ -982,8 +985,8 @@ amkillitemsbatch (IndexScanDesc scan,
    <function>amgetbatch</function> index AMs (those that don't can leave
    the field set to <literal>NULL</literal>), but doing so is recommended for
    performance, as it allows future scans to skip known-dead index entries.
-   The core index access method that currently supports
-   <function>amgetbatch</function> (B-tree) implements
+   Both core index access methods that currently support
+   <function>amgetbatch</function> (B-tree and hash) implement
    <literal>LP_DEAD</literal> marking, though third-party index access methods
    are free to choose whether to implement this feature.  The table AM may
    call <function>tableam_util_scanpos_killitem</function> to mark dead items as
@@ -1025,7 +1028,8 @@ amkillitemsbatch (IndexScanDesc scan,
    <command>VACUUM</command> recycling table TIDs &mdash; so it would be
    unsafe to assume that index entries still point to the same heap/table
    tuples.  Since <literal>LP_DEAD</literal> marking is only an optimization
-   hint, it is always safe to skip it.  B-tree uses this approach.
+   hint, it is always safe to skip it.  Both B-tree and hash use this
+   approach.
   </para>
 
   <warning>
@@ -1115,8 +1119,8 @@ amgetbitmap (IndexScanDesc scan,
    <function>amgetbitmap</function> scans; during
    <function>amgetbatch</function> scans the <literal>priorbatch</literal>
    is strictly owned by the caller (the table AM), and the index AM must
-   never release it.  See <function>_bt_next</function> for a reference
-   example.
+   never release it.  See <function>_bt_next</function> and
+   <function>_hash_next</function> for reference examples.
   </para>
 
   <para>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 84659b17a..446e68a84 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1233,6 +1233,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchData
 HashBuildState
 HashBulkDeleteStreamPrivate
 HashCompareFunc
@@ -1254,8 +1255,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.53.0



  [application/octet-stream] v28-0003-Limit-get_actual_variable_range-leaf-page-reads.patch (7.6K, 11-v28-0003-Limit-get_actual_variable_range-leaf-page-reads.patch)
  download | inline diff:
From 7b4b492372a667b18dc598dd47f05b69281549e3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sat, 20 Jun 2026 11:18:59 -0400
Subject: [PATCH v28 03/11] Limit get_actual_variable_range leaf page reads.

get_actual_variable_range scans an index to find actual min/max values
for planner selectivity estimation.  Since this happens during planning,
we can't afford to spend too much time on it.  Commit 9c6ad5eaa9 added
VISITED_PAGES_LIMIT (a limit of 100 heap page visits) to bound the
amount of work performed, giving up and falling back to the pg_statistic
extremal value when the limit is exceeded.  But that isn't effective in
cases with more extreme concentrations of dead index tuples.

Benchmark results from Mark Callaghan show that VISITED_PAGES_LIMIT
stops being effective once the dead index tuple problem gets out of hand
(which is expected with queue-like tables that continually delete older
records and insert newer ones).  VISITED_PAGES_LIMIT counts heap page
visits, but when many index tuples are marked LP_DEAD, _bt_readpage
traverses arbitrarily many index pages without returning any tuples.
The heap page counter never gets a chance to increment, so
VISITED_PAGES_LIMIT never triggers.  The more LP_DEAD bits we set, the
less effective the limit becomes at bailing out early.

Add a complementary mechanism that limits get_actual_variable_range to
scanning only three index leaf pages (INDEX_PAGES_LIMIT-many pages) that
have exactly zero matching items/that won't return a batch.  When the
limit is exceeded, the scan returns without returning any matches,
forcing get_actual_variable_range to give up.

INDEX_PAGES_LIMIT provides a backstop against reading an excessive
number of leaf pages, without fundamentally altering the existing
VISITED_PAGES_LIMIT design.  Leaf page reads that return a batch with at
least one matching item aren't tallied against the new limit.  This
balances the need for get_actual_variable_range to locate a min/max
value when that's feasible against the need to bound the amount of
work it must perform to do so.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzkt1WkKp4VRJu3qHfmKXc8W+XYv1RXg5d2d3fSvAeO=rg@mail.gmail.com
---
 src/include/access/nbtree.h           |  6 ++++++
 src/include/access/relscan.h          |  3 ++-
 src/backend/access/index/genam.c      |  1 +
 src/backend/access/nbtree/nbtree.c    |  1 +
 src/backend/access/nbtree/nbtsearch.c | 13 +++++++++++--
 src/backend/utils/adt/selfuncs.c      | 10 ++++++++++
 6 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 0254a223e..e7064813a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -998,6 +998,12 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
+
+	/*
+	 * Running count of leaf pages read without finding a match, compared
+	 * against scan->xs_index_pages_limit to bound planner scans
+	 */
+	int			numNoMatchPages;	/* no-batch-returned leaf page count */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 2f2314843..f2f66e367 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -404,12 +404,13 @@ typedef struct IndexScanDescData
 	int			xs_name_cstring_count;
 
 	/*
-	 * An approximate limit on the amount of work, measured in pages touched,
+	 * Approximate limits on the amount of work, measured in pages touched,
 	 * imposed on the index scan.  The default, 0, means no limit.  Only
 	 * honored during index-only scans.  Used by selfuncs.c to bound the cost
 	 * of get_actual_variable_endpoint().
 	 */
 	uint8		xs_visited_pages_limit;
+	uint8		xs_index_pages_limit;
 
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 54042f6f5..ca9bae803 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -139,6 +139,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_name_cstring_count = 0;
 
 	scan->xs_visited_pages_limit = 0;
+	scan->xs_index_pages_limit = 0;
 
 	return scan;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5939e728f..b83926f9f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -344,6 +344,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->arrayKeys = NULL;
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
+	so->numNoMatchPages = 0;
 
 	scan->opaque = so;
 	scan->xs_recheck = false;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4c94b9e59..88c87781c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1750,6 +1750,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	IndexScanBatch newbatch;
 	BTBatchData *btnewbatch;
 
@@ -1832,10 +1833,18 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		/* no matching tuples on this page */
 		_bt_relbuf(rel, btnewbatch->buf);
 
-		/* Continue the scan in this direction? */
+		/*
+		 * Continue the scan in this direction?
+		 *
+		 * Also give up if an opted-in planner scan (selfuncs.c) has now read
+		 * too many leaf pages without a match.  This bounds planning time
+		 * when the scanned end of the index is full of LP_DEAD-marked items.
+		 */
 		if (blkno == P_NONE ||
 			(ScanDirectionIsForward(dir) ?
-			 !btnewbatch->moreRight : !btnewbatch->moreLeft))
+			 !btnewbatch->moreRight : !btnewbatch->moreLeft) ||
+			(unlikely(scan->xs_index_pages_limit > 0) &&
+			 ++so->numNoMatchPages > scan->xs_index_pages_limit))
 		{
 			/*
 			 * blkno _bt_readpage call ended scan in this direction (though if
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fb978f0cf..1366f4988 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7185,8 +7185,17 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * We set xs_visited_pages_limit to tell the table AM to count distinct
 	 * heap pages visited for non-visible tuples and give up after the limit
 	 * is exceeded.
+	 *
+	 * We also set xs_index_pages_limit to independently tell the index AM to
+	 * give up when this many leaf pages that lack even one matching index
+	 * tuple have been read.  This acts as a backstop against pages entirely
+	 * full of index entries that were already marked killed (typically by
+	 * prior calls here).  That way we avoid hopelessly searching through an
+	 * unbounded number of index leaf pages that don't contain even a single
+	 * still-live entry (which can't trigger xs_visited_pages_limit).
 	 */
 #define VISITED_PAGES_LIMIT 100
+#define INDEX_PAGES_LIMIT 3
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
@@ -7196,6 +7205,7 @@ get_actual_variable_endpoint(Relation heapRel,
 								 SO_NONE);
 	Assert(index_scan->xs_want_itup);
 	index_scan->xs_visited_pages_limit = VISITED_PAGES_LIMIT;
+	index_scan->xs_index_pages_limit = INDEX_PAGES_LIMIT;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-- 
2.53.0



  [application/octet-stream] v28-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch (268.7K, 12-v28-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch)
  download | inline diff:
From 42c136761d2735e692896a9e7b31789089135ec5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Wed, 25 Mar 2026 16:48:43 -0400
Subject: [PATCH v28 02/11] Add amgetbatch interface and adopt it in nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain index scans and index-only scans that return index
entries in batches comprising all matching items from an index page,
rather than one match at a time.  Also switch nbtree over from
amgettuple to the new amgetbatch interface.

The new interface allows the table AM to apply knowledge of which TIDs
will be returned to the scan in the near future to perform optimizations
like I/O prefetching.  Prefetching is set to be added by an upcoming commit.

With amgetbatch, a scan-level policy determines whether each batch's
index page buffer pin is dropped eagerly by the index AM (for plain
scans with an MVCC snapshot, where the snapshot itself prevents TID
recycling problems) or retained as an interlock against concurrent TID
recycling by VACUUM.  The interlock is retained for plain non-MVCC scans
and for index-only scans, and is dropped by the table AM via the new
amunguardbatch callback when it is safe to do so. (Actually, index AMs
are usually able to drop the pin at the same time that they release the
lock.  In practice, the amunguardbatch callback is only really needed
during index-only scans, where dropping the pin interlock might need to
be delayed ever so slightly, as explained below.)

This extends the dropPin mechanism added to nbtree by commit 2ed5b87f,
and generalizes it to work with all index AMs that support the new
amgetbatch interface (LP_DEAD marking of index entries must be performed
by implementing the new amkillitemsbatch callback, which has a
documented contract describing how index AMs must reason about
concurrent TID recycling).  Scans can always safely drop index page pins
eagerly, provided the scan uses an MVCC snapshot (unlike the nbtree
dropPin optimization, which had no way of doing this safely during
index-only scans due to how amgettuple works, and only gained support
for scans of unlogged relations in recent commit 8a879119).

The old ammarkpos and amrestrpos index AM callbacks are removed.  With
amgetbatch, mark/restore of scan positions is managed by the table AM,
with help from indexbatch.c utility functions, rather than being wholly
delegated to the index AM.  The new index_scan_markpos and
index_scan_restrpos table AM callbacks must be implemented to make all
this work.  As a further condition, mark/restore is only supported by
index AMs that opt in by setting the new amcanmarkpos flag (only nbtree
sets this to true).  This amcanmarkpos flag scheme avoids the assumption
that every index AM is capable of picking up a scan from a previously
saved markBatch.

An upcoming commit that will add index prefetching will use a read
stream to read heap pages during index scans.  Read stream is careful to
limit how many things it pins, lest we run into problems due to having
too many buffers pinned.  Simply never holding on to index page buffer
pins greatly simplifies resource management for index prefetching;
there's no risk of unintended interactions between the read stream and
index AM.  The only downside is that we cannot support prefetching
during scans that use a non-MVCC snapshot, which seems quite acceptable.

In practice, heapam doesn't drop each batch's index page buffer pin at
the earliest opportunity during index-only scans.  This was deemed
necessary to avoid regressing index-only scans with a LIMIT, in
particular with nestloop anti-joins and nestloop semi-joins; eagerly
loading all the visibility information up front regressed such queries.
The new amgetbatch interface gives table AMs the authority to decide
when to drop index page pins/unguard batches, so this can be considered
a heapam implementation detail (index AMs don't need to know about it).
This scheme enables index prefetching to acquire and then drop any extra
batch index page pin within its read stream callback -- even when an
index-only scan (that must perform some heap fetches) holds open several
index batches at once in order to maintain an adequate prefetch
distance.  The read stream cannot observe any change in the backend's
buffer pin limit.

Index access methods that support plain index scans must now implement
either the amgetbatch interface or the amgettuple interface (not both).
Upcoming patches will add support for amgetbatch to the hash, GiST, and
SP-GiST index AMs.

Author: Tomas Vondra <[email protected]>
Author: Peter Geoghegan <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
---
 src/include/access/amapi.h                    |  29 +-
 src/include/access/genam.h                    |   5 +-
 src/include/access/heapam.h                   |   7 +-
 src/include/access/indexbatch.h               | 535 ++++++++++++
 src/include/access/nbtree.h                   | 190 ++---
 src/include/access/relscan.h                  | 206 ++++-
 src/include/access/tableam.h                  | 102 ++-
 src/include/nodes/pathnodes.h                 |   6 +-
 src/backend/access/brin/brin.c                |   7 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   7 +-
 src/backend/access/gist/gist.c                |   7 +-
 src/backend/access/hash/hash.c                |   7 +-
 src/backend/access/heap/heapam_handler.c      |   5 +-
 src/backend/access/heap/heapam_indexscan.c    | 525 +++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/amapi.c              |   5 +
 src/backend/access/index/genam.c              |   7 +
 src/backend/access/index/indexam.c            | 138 +--
 src/backend/access/index/indexbatch.c         | 798 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |  74 +-
 src/backend/access/nbtree/nbtpage.c           |  13 +-
 src/backend/access/nbtree/nbtreadpage.c       | 207 +++--
 src/backend/access/nbtree/nbtree.c            | 465 +++++-----
 src/backend/access/nbtree/nbtsearch.c         | 566 ++++++-------
 src/backend/access/nbtree/nbtutils.c          | 245 ------
 src/backend/access/nbtree/nbtxlog.c           |   6 +-
 src/backend/access/spgist/spgutils.c          |   7 +-
 src/backend/access/table/tableamapi.c         |   5 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/nodeIndexonlyscan.c      |   4 +-
 src/backend/executor/nodeIndexscan.c          |   4 +-
 src/backend/executor/nodeMergejoin.c          |   4 +-
 src/backend/optimizer/path/indxpath.c         |   6 +-
 src/backend/optimizer/util/plancat.c          |   8 +-
 src/backend/replication/logical/relation.c    |   9 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 contrib/amcheck/verify_nbtree.c               |   2 +-
 contrib/bloom/blutils.c                       |   7 +-
 doc/src/sgml/indexam.sgml                     | 588 +++++++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   7 +-
 src/test/regress/expected/join.out            |  63 +-
 src/test/regress/expected/portals.out         |  57 ++
 src/test/regress/sql/join.sql                 |  38 +-
 src/test/regress/sql/portals.sql              |  34 +
 src/tools/pgindent/typedefs.list              |  15 +-
 48 files changed, 3614 insertions(+), 1439 deletions(-)
 create mode 100644 src/include/access/indexbatch.h
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 792403335..02793a115 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -199,6 +199,19 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan,
+											   IndexScanBatch priorbatch,
+											   ScanDirection direction);
+
+/* drop TID recycling interlock held to prevent concurrent VACUUM recycling */
+typedef void (*amunguardbatch_function) (IndexScanDesc scan,
+										 IndexScanBatch batch);
+
+/* mark dead items in index page */
+typedef void (*amkillitemsbatch_function) (IndexScanDesc scan,
+										   IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -206,11 +219,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -255,6 +266,8 @@ typedef struct IndexAmRoutine
 	bool		amconsistentordering;
 	/* does AM support backward scanning? */
 	bool		amcanbackward;
+	/* does AM support mark/restore of a scan position? */
+	bool		amcanmarkpos;
 	/* does AM support UNIQUE indexes? */
 	bool		amcanunique;
 	/* does AM support multi-column indexes? */
@@ -310,10 +323,12 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amunguardbatch_function amunguardbatch; /* can be NULL */
+	amkillitemsbatch_function amkillitemsbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 05eec9204..615686dc2 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -96,6 +96,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -169,8 +170,6 @@ extern void index_rescan(IndexScanDesc scan,
 						 ScanKey keys, int nkeys,
 						 ScanKey orderbys, int norderbys);
 extern void index_endscan(IndexScanDesc scan);
-extern void index_markpos(IndexScanDesc scan);
-extern void index_restrpos(IndexScanDesc scan);
 extern Size index_parallelscan_estimate(Relation indexRelation,
 										int nkeys, int norderbys, Snapshot snapshot);
 extern void index_parallelscan_initialize(Relation heapRelation,
@@ -184,8 +183,6 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan,
 											  uint32 flags);
-extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4fea51761..5364ce27b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -128,6 +128,7 @@ typedef struct IndexScanHeapData
 
 	/* For visibility map checks (index-only scans and on-access pruning) */
 	Buffer		xs_vmbuffer;	/* visibility map buffer */
+	int			xs_vm_items;	/* # items to resolve visibility info for */
 
 	bool		xs_readonly;	/* scan is read-only? */
 
@@ -438,8 +439,12 @@ extern TransactionId heap_index_delete_tuples(Relation rel,
 extern bool heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot,
 							 TupleTableSlot *slot, bool *all_dead);
 extern void heapam_index_scan_begin(IndexScanDesc scan, uint32 flags);
-extern void heapam_index_scan_reset(IndexScanDesc scan);
+extern void heapam_index_scan_batch_init(IndexScanDesc scan,
+										 IndexScanBatch batch);
+extern void heapam_index_scan_rescan(IndexScanDesc scan);
 extern void heapam_index_scan_end(IndexScanDesc scan);
+extern void heapam_index_scan_markpos(IndexScanDesc scan);
+extern void heapam_index_scan_restrpos(IndexScanDesc scan);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
 								   bool *all_dead, bool first_call);
diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h
new file mode 100644
index 000000000..9471a9db5
--- /dev/null
+++ b/src/include/access/indexbatch.h
@@ -0,0 +1,535 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.h
+ *	  Batch-based index scan infrastructure for the amgetbatch interface.
+ *
+ * This header declares the inline functions, macros, and externs that table
+ * AMs and index AMs use to operate on index scan batches and the scan's batch
+ * ring buffer.  See indexbatch.c for an overview of the module.
+ *
+ * The data structures that these functions operate on are defined in
+ * relscan.h, not here.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/indexbatch.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef INDEXBATCH_H
+#define INDEXBATCH_H
+
+#include "access/amapi.h"
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "storage/buf.h"
+#include "utils/rel.h"
+
+/* ----------------------------------------------------------------------------
+ * Elementary batch ring buffer operations
+ * ----------------------------------------------------------------------------
+ */
+
+StaticAssertDecl(INDEX_SCAN_MAX_BATCHES <= PG_INT8_MAX + 1,
+				 "index_scan_batch_loaded relies on int8 ring buffer arithmetic");
+StaticAssertDecl((INDEX_SCAN_MAX_BATCHES & (INDEX_SCAN_MAX_BATCHES - 1)) == 0,
+				 "INDEX_SCAN_MAX_BATCHES must be a power of 2");
+
+/*
+ * How many batches are currently loaded in the ring buffer?
+ */
+static inline uint8
+index_scan_batch_count(IndexScanDesc scan)
+{
+	return (uint8) (scan->batchringbuf.nextBatch -
+					scan->batchringbuf.headBatch);
+}
+
+/*
+ * Do we already have a batch loaded at 'idx' offset in scan's ring buffer?
+ *
+ * NOTE: a stale batch idx can alias a currently-loaded range due to
+ * wraparound, producing a false positive.  False negatives are not possible.
+ */
+static inline bool
+index_scan_batch_loaded(IndexScanDesc scan, uint8 idx)
+{
+	return (int8) (idx - scan->batchringbuf.headBatch) >= 0 &&
+		(int8) (idx - scan->batchringbuf.nextBatch) < 0;
+}
+
+/*
+ * Have we loaded the maximum number of batches?
+ */
+static inline bool
+index_scan_batch_full(IndexScanDesc scan)
+{
+	return index_scan_batch_count(scan) == INDEX_SCAN_MAX_BATCHES;
+}
+
+/*
+ * Return batch for the provided index.
+ */
+static inline IndexScanBatch
+index_scan_batch(IndexScanDesc scan, uint8 idx)
+{
+	Assert(index_scan_batch_loaded(scan, idx));
+
+	return scan->batchbuf[idx & (INDEX_SCAN_MAX_BATCHES - 1)];
+}
+
+/*
+ * Append given batch to scan's batch ring buffer.
+ */
+static inline void
+index_scan_batch_append(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BatchRingBuffer *ringbuf = &scan->batchringbuf;
+	uint8		nextBatch = ringbuf->nextBatch;
+
+	Assert(!index_scan_batch_full(scan));
+
+	scan->batchbuf[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch;
+	ringbuf->nextBatch++;
+}
+
+/* ----------------------------------------------------------------------------
+ * Batch memory layout accessors
+ *
+ * Each batch allocation has the following memory layout:
+ *
+ *   [table AM opaque area]    <- table AM area (batch_table_opaque_size),
+ *                                optionally requested by table AM
+ *   [index AM static opaque]  <- index AM area (batch_index_opaque_static),
+ *                                mandatory fixed-size index AM area
+ *   [IndexScanBatchData]      <- batch pointer, returned by amgetbatch
+ *   [items[maxitemsbatch]]
+ *   [currTuples workspace]    <- index tuple area (batch_tuples_workspace),
+ *                                only used during index-only scans
+ *
+ * batch_base_offset combines the table AM opaque area, the optional dynamic
+ * index AM opaque area, and the static index AM opaque area into a single
+ * offset from the batch pointer to the true allocation base.  We pfree a
+ * batch by passing pfree a pointer returned by index_scan_batch_base.  We
+ * rely on the assumption that batches have a fixed layout for the duration of
+ * an index scan (batches are cached for reuse to avoid palloc churn).
+ *
+ * The table AM accesses its opaque area (sized batch_table_opaque_size) using
+ * the index_scan_batch_table_area shim accessor.  The table AM's
+ * table_index_scan_begin callback is permitted to vary the layout of its
+ * opaque area as it sees fit (or to request no area), often based on the
+ * requirements of one particular scan.  Bitmap scans never get a table AM
+ * opaque area (the table AM isn't involved, even when an amgetbitmap routine
+ * reuses the batch infrastructure internally).
+ *
+ * An index AM is required to provide a fixed-size opaque area.  This area is
+ * sized MAXALIGN(sizeof(the AM's struct)), and is always known at compile
+ * time.  Index AMs use index_scan_batch_index_opaque_static to access the
+ * area.  Access to the area is cheap (a compile-time-constant subtraction),
+ * but its size cannot vary from scan to scan.  Index AMs typically use this
+ * area to store things like index page sibling link block numbers.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the true allocation base of a batch (used to pfree batches)
+ */
+static inline void *
+index_scan_batch_base(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Assert(scan->batch_base_offset > 0);
+
+	return (char *) batch - scan->batch_base_offset;
+}
+
+/*
+ * Return a pointer to the table AM opaque area
+ */
+static inline void *
+index_scan_batch_table_area(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * The table AM opaque area is always at the beginning of the batch's
+	 * allocated space
+	 */
+	return index_scan_batch_base(scan, batch);
+}
+
+/*
+ * Return a typed pointer to the index AM's static (compile-time sized) opaque
+ * area, which sits immediately before the batch pointer.  Index AMs use their
+ * own wrapper function-style macro, built on top of this.
+ */
+#define index_scan_batch_index_opaque_static(scan, batch, type) \
+	(AssertMacro((scan)->batch_index_opaque_static == MAXALIGN(sizeof(type))), \
+	 ((type *) ((char *) (batch) - MAXALIGN(sizeof(type)))))
+
+/* ----------------------------------------------------------------------------
+ * Elementary batch position operations
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Advance position to its next item in the batch.
+ *
+ * Advance to the next item within the provided batch (or to the previous item,
+ * when scanning backwards).
+ *
+ * Returns true if the position could be advanced.  Returns false when there
+ * are no more items from the batch remaining in the given scan direction.
+ */
+static inline bool
+index_scan_pos_advance(ScanDirection direction,
+					   IndexScanBatch batch, BatchRingItemPos *pos)
+{
+	/*
+	 * On entry, pos->item must be valid, and must actually point to a valid
+	 * item for this batch.  There is exactly one exception: pos->item may
+	 * initially sit one step outside the batch when caller just flipped its
+	 * scan direction.  pos->item will point to a valid item once we return
+	 * (we _must_ return true when passed a just-stepped-off-batch position).
+	 *
+	 * This precondition ensures that callers actually step to the next batch
+	 * when indicated (or flip the scan direction instead, which can happen
+	 * right after a cursor tries to step off the final batch in the given
+	 * scan direction).  Table AMs must avoid ambiguous positional states.
+	 */
+	Assert(pos->valid);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Precondition: valid-or-just-before-start item position */
+		Assert(pos->item >= batch->firstItem - 1);
+		Assert(pos->item <= batch->lastItem);
+
+		if (++pos->item > batch->lastItem)
+			return false;
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		/* Precondition: valid-or-just-past-end item position */
+		Assert(pos->item >= batch->firstItem);
+		Assert(pos->item <= batch->lastItem + 1);
+
+		if (--pos->item < batch->firstItem)
+			return false;
+	}
+
+	/* Advanced within batch */
+	return true;
+}
+
+/*
+ * Position pos at the start of newBatch (in the given scan direction).
+ *
+ * When we're called, pos should point to a batch that caller just finished
+ * consuming from (or be invalid, when no batch has been loaded for caller's
+ * scan yet).  When we return, pos will point to newBatch, the next batch from
+ * the ring buffer.  We'll have also set pos's item offset to newBatch's
+ * initial item in the given direction (the first item when scanning forwards,
+ * the last item when scanning backwards).
+ *
+ * newBatch doesn't have to be (and often isn't) the most recently appended
+ * batch in the scan's ring buffer.  It is merely the next batch in line to be
+ * consumed from the point of view of our caller.
+ */
+static inline void
+index_scan_pos_startbatch(ScanDirection direction,
+						  IndexScanBatch newBatch, BatchRingItemPos *pos)
+{
+	Assert(newBatch->dir == direction);
+	Assert(newBatch->firstItem >= 0 && newBatch->firstItem <= newBatch->lastItem);
+
+	/* Increment batch (might wrap), or initialize it to zero */
+	if (pos->valid)
+		pos->batch++;
+	else
+		pos->batch = 0;
+
+	pos->valid = true;
+
+	if (ScanDirectionIsForward(direction))
+		pos->item = newBatch->firstItem;
+	else
+		pos->item = newBatch->lastItem;
+}
+
+/* ----------------------------------------------------------------------------
+ * Utilities called by table AMs
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Sets up the batch ring buffer structure for use by an index scan.
+ *
+ * Called from table AM's index_scan_begin callback during amgetbatch scans.
+ */
+static inline void
+tableam_util_batchscan_init(IndexScanDesc scan)
+{
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->batchringbuf.scanPos.valid = false;
+	scan->batchringbuf.markPos.valid = false;
+
+	scan->batchringbuf.markBatch = NULL;
+	scan->batchringbuf.headBatch = 0;
+	scan->batchringbuf.nextBatch = 0;
+
+	scan->usebatchring = true;
+}
+
+extern void tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan);
+extern void tableam_util_batchscan_end(IndexScanDesc scan);
+extern void tableam_util_batchscan_mark_pos(IndexScanDesc scan);
+extern void tableam_util_batchscan_restore_pos(IndexScanDesc scan);
+extern void tableam_util_scanbatch_dirchange(IndexScanDesc scan);
+extern void tableam_util_scanpos_killitem(IndexScanDesc scan);
+extern void tableam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch);
+extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch);
+
+/*
+ * Try to advance the scan's scanPos to the next matching item from the
+ * scan's existing scanBatch, moving in the given scan direction.
+ *
+ * Sets *scanBatch to the ring buffer's existing scanBatch, or to NULL when no
+ * batch has been loaded yet (the first call here for the entire scan).
+ *
+ * Returns true when scanPos was advanced, in which case the scan should
+ * process the item that scanPos now points to.  Returns false when there are
+ * no more matching items remaining in scanBatch (or when no scanBatch has
+ * been loaded yet).  Caller responds to a false return by passing *scanBatch
+ * to tableam_util_fetch_next_batch as its priorBatch argument, advancing the
+ * scan to its next batch.
+ */
+static pg_attribute_always_inline bool
+tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction,
+							 IndexScanBatch *scanBatch, BatchRingItemPos *scanPos)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+
+	if (!scanPos->valid)
+	{
+		/* First call here for the entire scan */
+		Assert(index_scan_batch_count(scan) == 0);
+
+		*scanBatch = NULL;
+		return false;
+	}
+
+	/*
+	 * scanPos is valid, so scanBatch must already be loaded in batch ring
+	 * buffer.  We rely on that here.
+	 */
+	pg_assume(batchringbuf->headBatch == scanPos->batch);
+
+	*scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	return index_scan_pos_advance(direction, *scanBatch, scanPos);
+}
+
+/*
+ * Fetch the next batch of matching items for the scan (or the first).
+ *
+ * Called when caller's current batch (passed to us as priorBatch) has no more
+ * matching items in the given scan direction.  Caller passes a NULL
+ * priorBatch on the first call here for the scan.
+ *
+ * Returns the next batch to be processed by caller in the given scan
+ * direction, or NULL when there are no more matches in that direction.
+ * Returned batch will have already been appended to the scan's ring buffer
+ * (though not necessarily during this call).
+ *
+ * We don't free any batches here; that is a separate step performed by
+ * tableam_util_scanpos_nextbatch.  Caller also needs to advance their
+ * position to the start of the returned batch.
+ */
+static pg_attribute_always_inline IndexScanBatch
+tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction,
+							  IndexScanBatch priorBatch, BatchRingItemPos *pos)
+{
+	IndexScanBatch batch = NULL;
+	BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf;
+
+	Assert(scan->usebatchring);
+
+	if (!priorBatch)
+	{
+		/* First call for the scan */
+		Assert(pos == &batchringbuf->scanPos);
+	}
+	else if (unlikely(priorBatch->dir != direction))
+	{
+		/*
+		 * We detected a change in scan direction across batches.  Prepare
+		 * scan's batchringbuf state for us to get the next batch for the
+		 * opposite scan direction to the one used when priorBatch was
+		 * returned by amgetbatch.
+		 */
+		tableam_util_scanbatch_dirchange(scan);
+
+		/* priorBatch is now batchringbuf's only batch */
+		Assert(pos->batch == batchringbuf->headBatch);
+		Assert(index_scan_batch_count(scan) == 1);
+	}
+	else if (index_scan_batch_loaded(scan, pos->batch + 1))
+	{
+		/* Next batch already loaded for us */
+		batch = index_scan_batch(scan, pos->batch + 1);
+
+		Assert(priorBatch->dir == direction);
+		Assert(batch->dir == direction);
+		Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+		return batch;
+	}
+
+	/*
+	 * Assert preconditions for calling amgetbatch.
+	 *
+	 * priorBatch had better be for the last valid batch currently in the ring
+	 * buffer (batches must stay in scan order).  If it isn't then we should
+	 * have already returned some existing loaded batch earlier.
+	 */
+	Assert(!index_scan_batch_full(scan));
+	Assert(!priorBatch ||
+		   (index_scan_batch_count(scan) > 0 && priorBatch->dir == direction &&
+			index_scan_batch(scan, batchringbuf->nextBatch - 1) == priorBatch));
+
+	/*
+	 * Before we call amgetbatch again, check if priorBatch is already known
+	 * to be the last batch with matching items in this scan direction
+	 */
+	if (priorBatch &&
+		(ScanDirectionIsForward(direction) ?
+		 priorBatch->knownEndForward :
+		 priorBatch->knownEndBackward))
+		return NULL;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorBatch,
+													  direction);
+	if (batch)
+	{
+		/* We got the batch from the index AM */
+		Assert(batch->dir == direction);
+		Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+		/* Append batch to the end of ring buffer/write it to buffer index */
+		index_scan_batch_append(scan, batch);
+
+		/*
+		 * Theoretically we should set knownEndForward/knownEndBackward to
+		 * false (whichever is used when moving in the opposite direction)
+		 * when this is the scan's first returned batch.  We don't bother
+		 * because the index AM should always record that fact in its own
+		 * opaque area.  (These fields only exist because we don't want index
+		 * AMs setting _any_ field from any priorbatch that we pass to them.
+		 * Besides, it would be cumbersome for index AMs to keep track of
+		 * which batch is the current amgetbatch call's original priorbatch.)
+		 */
+	}
+	else
+	{
+		/* amgetbatch returned NULL */
+		if (priorBatch)
+		{
+			/*
+			 * There are no further matches to be found in the current scan
+			 * direction, following priorBatch.  Remember that priorBatch is
+			 * the last batch with matching items.
+			 */
+			if (ScanDirectionIsForward(direction))
+				priorBatch->knownEndForward = true;
+			else
+				priorBatch->knownEndBackward = true;
+		}
+	}
+
+	return batch;
+}
+
+/*
+ * Position scanPos at the start of newScanBatch (in the given scan
+ * direction), and remove the scan's old scanBatch from the ring buffer.
+ *
+ * Called after tableam_util_fetch_next_batch returns newScanBatch, the next
+ * batch that scanPos will consume matching items from.  We release the
+ * now-obsolescent old scanBatch (the ring buffer's head batch), freeing up
+ * its ring buffer slot.  (When newScanBatch is the scan's first batch, there
+ * is no old scanBatch for us to release.)
+ */
+static pg_attribute_always_inline void
+tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction,
+							   IndexScanBatch newScanBatch)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &batchringbuf->scanPos;
+	bool		releaseOldHeadBatch = scanPos->valid;
+	IndexScanBatch headBatch;
+
+	/* Position scanPos to the start of new scanBatch */
+	index_scan_pos_startbatch(direction, newScanBatch, scanPos);
+	Assert(index_scan_batch(scan, scanPos->batch) == newScanBatch);
+
+	if (!releaseOldHeadBatch)
+	{
+		/* newScanBatch is the scan's first and only batch */
+		Assert(batchringbuf->headBatch == scanPos->batch);
+		return;
+	}
+
+	headBatch = index_scan_batch(scan, batchringbuf->headBatch);
+
+	Assert(headBatch != newScanBatch);
+	Assert(batchringbuf->headBatch != scanPos->batch);
+
+	/* free obsolescent head batch (unless it is scan's markBatch) */
+	tableam_util_release_batch(scan, headBatch);
+
+	/* Remove the batch from the ring buffer (even if it's markBatch) */
+	batchringbuf->headBatch++;
+
+	/* Postconditions for having freed up a ring buffer slot */
+	Assert(!index_scan_batch_full(scan));
+	Assert(batchringbuf->headBatch == scanPos->batch);
+}
+
+/*
+ * Fetch the next matching TID for the scan (or the first).
+ *
+ * This is the amgettuple equivalent of tableam_util_fetch_next_batch.
+ *
+ * There is no batch-like state for us to manage (typically that's up to the
+ * index AM when it implements amgettuple).
+ */
+static pg_attribute_always_inline ItemPointer
+tableam_util_fetch_next_tuple_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	Assert(!scan->usebatchring);
+
+	found = scan->indexRelation->rd_indam->amgettuple(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	Assert(!scan->xs_heap_continue);
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+		return NULL;
+
+	/* Return the TID of the tuple we found */
+	return &scan->xs_heaptid;
+}
+
+/* ----------------------------------------------------------------------------
+ * Utilities called by index AMs
+ * ----------------------------------------------------------------------------
+ */
+extern void indexam_util_unlock_batch(IndexScanDesc scan, IndexScanBatch batch,
+									  Buffer buf);
+extern IndexScanBatch indexam_util_alloc_batch(IndexScanDesc scan);
+extern void indexam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch);
+
+#endif							/* INDEXBATCH_H */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3097e9bb1..0254a223e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -15,6 +15,7 @@
 #define NBTREE_H
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/itup.h"
 #include "access/sdir.h"
 #include "catalog/pg_am_d.h"
@@ -924,112 +925,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +945,43 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/* Per-batch data private to the btree index AM */
+typedef struct BTBatchData
+{
+	Buffer		buf;			/* leaf page's buffer pin */
+	BlockNumber currPage;		/* leaf page's block number */
+	BlockNumber prevPage;		/* leaf page's left sibling */
+	BlockNumber nextPage;		/* leaf page's right sibling */
+	bool		moreLeft;		/* more pages of interest to the left? */
+	bool		moreRight;		/* more pages of interest to the right? */
+} BTBatchData;
+
+/* Access the btree-private per-batch data from an IndexScanBatch pointer */
+#define BTBatchGetData(scan, batch) \
+	index_scan_batch_index_opaque_static(scan, batch, BTBatchData)
+
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage,
+ * btposreset, and _bt_advance_array_keys), but we don't track anything about
+ * the current scan position/batch in this opaque struct.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic.  _bt_next is passed priorbatch, which has a BTBatchData area that
+ * tells us which page is next in line to be read in the given scan direction
+ * (this is often the same priorbatch passed to btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its table TID and offset.  Posting list tuples
+ * store a "base" tuple once, allowing the same key to be used for each TID in
+ * the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +998,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1066,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan,
+								 IndexScanBatch priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btunguardbatch(IndexScanDesc scan, IndexScanBatch batch);
+extern void btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch batch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1180,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,15 +1197,15 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access, bool returnstack);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
  * prototypes for functions in nbtutils.c
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
-extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e2e2150da..2f2314843 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -122,13 +122,167 @@ typedef struct ParallelBlockTableScanWorkerData
 } ParallelBlockTableScanWorkerData;
 typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 
+/*
+ * Data structures used by amgetbatch index scans.
+ *
+ * These structs are defined here, rather than in access/indexbatch.h, only
+ * because IndexScanDescData embeds them by value.  relscan.h defines data
+ * structures only; all of the functions that operate on them appear in
+ * access/indexbatch.h.
+ */
+
+/*
+ * Location of a BatchMatchingItem within the scan's ring buffer
+ */
+typedef struct BatchRingItemPos
+{
+	/* Position references a valid IndexScanDescData.batchbuf[] entry? */
+	bool		valid;
+
+	/* IndexScanDescData.batchbuf[]-wise index to relevant IndexScanBatch */
+	uint8		batch;
+
+	/* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+
+} BatchRingItemPos;
+
+/*
+ * Matching item returned by amgetbatch (in returned IndexScanBatch) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData tableTid;	/* TID of referenced table item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* index tuple's currTuples offset, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans.
+ *
+ * The batch pointer returned by amgetbatch points into the interior of a
+ * larger allocation, which also carries opaque areas for the table AM and
+ * the index AM.  See access/indexbatch.h for the layout of batch allocations
+ * and for the accessors used to reach each constituent area.
+ */
+typedef struct IndexScanBatchData
+{
+	/* Index page's LSN, optionally used by amkillitemsbatch routines */
+	XLogRecPtr	lsn;
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * knownEndBackward and knownEndForward indicate that this batch is the
+	 * last one with matching items in the relevant scan direction.  When
+	 * amgetbatch returns NULL for a given direction, the corresponding flag
+	 * is set on the priorbatch that was passed to that call.  We cannot know
+	 * this when a batch is first returned by amgetbatch; it only becomes
+	 * apparent when we try and fail to continue the scan past it.
+	 *
+	 * This allows table AMs to avoid redundant amgetbatch calls with the same
+	 * priorbatch -- the index AM might need to read additional index pages to
+	 * determine there are no more matching items beyond caller's priorbatch.
+	 */
+	bool		knownEndBackward;
+	bool		knownEndForward;
+
+	/*
+	 * Batch still holds TID recycling interlock?
+	 */
+	bool		isGuarded;
+
+	/*
+	 * Matching items state for this batch.  Output by index AM for table AM.
+	 *
+	 * The items array is always ordered in index order (ie, by increasing
+	 * indexoffset).  When scanning backwards it is convenient for index AMs
+	 * to fill the array back-to-front, starting at the last item slot and
+	 * filling downwards.  This is why we need both a first-valid-entry and a
+	 * last-valid-entry counter.
+	 *
+	 * Note: these are signed because it's sometimes convenient to use -1 to
+	 * represent an out-of-bounds space just before firstItem (when it's 0).
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about dead items, if any (palloc'd separately, NULL if unused) */
+	int			numDead;		/* number of currently stored items */
+	int		   *deadItems;		/* items[]-wise indexes of dead items */
+
+	/*
+	 * If we are doing an index-only scan, this is the tuple storage workspace
+	 * for the matching tuples (tuples referenced by items[]).  The workspace
+	 * size is determined by the index AM (batch_tuples_workspace).
+	 *
+	 * currTuples points into the trailing portion of this allocation,
+	 * directly past items[].  It is NULL for plain index scans.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */
+} IndexScanBatchData;
+
+typedef struct IndexScanBatchData *IndexScanBatch;
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans use a ring buffer of batches returned by amgetbatch.
+ *
+ * This data structure provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ */
+typedef struct BatchRingBuffer
+{
+	/* current positions in IndexScanDescData.batchbuf[] for scan */
+	BatchRingItemPos scanPos;	/* scan's read position */
+	BatchRingItemPos markPos;	/* mark/restore position */
+
+	/* markPos's batch (not in ring buffer when markBatch != scanBatch) */
+	IndexScanBatch markBatch;
+
+	/*
+	 * headBatch is an index to the earliest still-valid ring buffer batch
+	 * slot in batchbuf[].  The actual array position for its IndexScanBatch
+	 * is headBatch & (INDEX_SCAN_MAX_BATCHES - 1), since these indexes use
+	 * unsigned wrapping arithmetic.  headBatch must be the scan's current
+	 * scanBatch (i.e. the current scanPos batch).
+	 */
+	uint8		headBatch;
+
+	/*
+	 * nextBatch is an index to the next _empty_ ring buffer batch slot in
+	 * batchbuf[] (i.e. it's the tail entry of our ring buffer).  The actual
+	 * batchbuf[] array position is nextBatch & (INDEX_SCAN_MAX_BATCHES - 1).
+	 * New batches can only be safely appended to this tail position when
+	 * !index_scan_batch_full() (see access/indexbatch.h).
+	 *
+	 * Note: the scan's most recently appended batch is always located at
+	 * (nextBatch - 1) & (INDEX_SCAN_MAX_BATCHES - 1).
+	 */
+	uint8		nextBatch;
+} BatchRingBuffer;
+
 struct IndexScanInstrumentation;
 
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
- * amgettuple-based scans.
+ * amgettuple-based scans.  Others are only used in amgetbatch-based scans.
+ *
+ * The ring buffer used by amgetbatch scans is stored here as a fixed array of
+ * pointers to batches.  We need a minimum of two ring buffer batches (but use
+ * INDEX_SCAN_MAX_BATCHES), since table AMs only remove a batch after they've
+ * already called amgetbatch again and appended the returned batch.
  */
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_MAX_BATCHES		64
+
 typedef struct IndexScanDescData
 {
 	/* scan parameters */
@@ -139,6 +293,26 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* scan's amgetbatch state (only used by amgetbatch/usebatchring scans) */
+	BatchRingBuffer batchringbuf;
+
+	/*
+	 * Array of pointers to recyclable batches, used by all amgetbatch scans
+	 * and by amgetbitmap scans of an index AM that supports amgetbatch
+	 */
+	IndexScanBatch batchcache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to batches, referenced within batchringbuf */
+	IndexScanBatch batchbuf[INDEX_SCAN_MAX_BATCHES];
+
+	bool		usebatchring;	/* scan uses amgetbatch/batchringbuf? */
+	bool		batchImmediateUnguard;	/* eagerly drop TID recycling
+										 * interlock? */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -147,13 +321,12 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
-
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/* xs_snapshot uses an MVCC snapshot? */
+	bool		MVCCScan;
 
 	/*
-	 * Instrumentation counters maintained by all index AMs during both
-	 * amgettuple calls and amgetbitmap calls (unless field remains NULL)
+	 * Instrumentation counters maintained during amgetbatch, amgetbitmap, and
+	 * amgettuple scans (unless field remains NULL)
 	 */
 	struct IndexScanInstrumentation *instrument;
 
@@ -185,14 +358,31 @@ typedef struct IndexScanDescData
 
 	/*
 	 * Resolved table_index_getnext_slot callback, which is set by
-	 * table_index_scan_begin at the start of amgettuple scans.  Reports via
-	 * *recheck whether the scan keys must be rechecked.
+	 * table_index_scan_begin at the start of amgetbatch/amgettuple scans.
+	 * Reports via *recheck whether the scan keys must be rechecked.
 	 */
 	bool		(*xs_getnext_slot) (struct IndexScanDescData *scan,
 									ScanDirection direction,
 									struct TupleTableSlot *slot,
 									bool *recheck);
 
+	/* batch size information, set once by index AM in ambeginscan */
+	uint16		maxitemsbatch;	/* size of each batch's items[] array */
+	uint16		batch_index_opaque_static;	/* compile-time opaque size */
+	uint16		batch_tuples_workspace; /* currTuples workspace size */
+
+	/*
+	 * Optional table AM per-batch opaque area size, set once by
+	 * index_scan_begin (except during bitmap scans)
+	 */
+	uint32		batch_table_opaque_size;	/* table AM opaque area size */
+
+	/*
+	 * Offset used by index_scan_batch_base (set on first batch alloc).  See
+	 * access/indexbatch.h.
+	 */
+	size_t		batch_base_offset;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8f268e4d8..97a73132a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -454,10 +454,12 @@ typedef struct TableAmRoutine
 	 * flags is a bitmask of ScanOptions affecting underlying table scan
 	 * behavior. See scan_begin() for more information on passing these.
 	 *
-	 * Callback is responsible for setting IndexScanDesc.xs_getnext_slot to
-	 * the appropriate slot-based callback.  Tuples are then returned through
-	 * the caller's slot, via table_index_getnext_slot().  No separate
-	 * slot-based callback exists in this struct!
+	 * Callback is responsible for initializing the scan's batch ring buffer
+	 * (when the scan's index AM supports the amgetbatch interface), and for
+	 * setting IndexScanDesc.xs_getnext_slot to the appropriate slot-based
+	 * callback.  Tuples are then returned through the caller's slot, via
+	 * table_index_getnext_slot().  No separate slot-based callback exists in
+	 * this struct!
 	 *
 	 * In principle a single general-purpose callback (stored here) would
 	 * suffice, but using specialized variants allows the table AM to provide
@@ -475,20 +477,44 @@ typedef struct TableAmRoutine
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
 	 * older version of the tuple.
+	 *
+	 * Callback also initializes the scan descriptor's batch_table_opaque_size
+	 * field, to let the core code know how much memory will be required in
+	 * the table AM portion of each batch allocation (though only during
+	 * amgetbatch index scans).  See relscan.h for full details.
 	 */
 	void		(*index_scan_begin) (IndexScanDesc scan, uint32 flags);
 
 	/*
-	 * Inform the table AM that there's to be either a rescan or a restore of
-	 * a marked position, or that the scan has run out of index entries.
+	 * Initialize table AM's per-batch opaque area within a batch allocation.
+	 *
+	 * Called by indexam_util_alloc_batch for each new or recycled batch, but
+	 * only when the table AM reserved an opaque area for the scan (by setting
+	 * batch_table_opaque_size to a value > 0).
 	 */
-	void		(*index_scan_reset) (IndexScanDesc scan);
+	void		(*index_scan_batch_init) (IndexScanDesc scan,
+										  IndexScanBatch batch);
+
+	/*
+	 * Inform the table AM that there's to be a rescan.
+	 */
+	void		(*index_scan_rescan) (IndexScanDesc scan);
 
 	/*
 	 * Release resources and deallocate index scan state.
 	 */
 	void		(*index_scan_end) (IndexScanDesc scan);
 
+	/*
+	 * Mark the current scan position so it can be restored later.
+	 */
+	void		(*index_scan_markpos) (IndexScanDesc scan);
+
+	/*
+	 * Restore a previously marked scan position.
+	 */
+	void		(*index_scan_restrpos) (IndexScanDesc scan);
+
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
 	 * ------------------------------------------------------------------------
@@ -1271,15 +1297,53 @@ table_index_scan_begin(IndexScanDesc scan, uint32 flags)
 }
 
 /*
- * Inform the table AM that there's to be either a rescan or a restore of a
- * marked position, or that the scan has run out of index entries.
+ * Inform the table AM that there's to be a rescan.
  */
 static inline void
-table_index_scan_reset(IndexScanDesc scan)
+table_index_scan_rescan(IndexScanDesc scan)
 {
 	Assert(scan->xs_table_opaque);
 
-	scan->heapRelation->rd_tableam->index_scan_reset(scan);
+	scan->heapRelation->rd_tableam->index_scan_rescan(scan);
+}
+
+/*
+ * Mark the current scan position so it can be restored later
+ */
+static inline void
+table_index_scan_markpos(IndexScanDesc scan)
+{
+	Assert(scan->xs_table_opaque && scan->usebatchring);
+
+	scan->heapRelation->rd_tableam->index_scan_markpos(scan);
+}
+
+/*
+ * Restore a previously marked scan position
+ *
+ * NOTE: this only restores the batch positional state of the table AM.  See
+ * comments for ExecRestrPos().
+ */
+static inline void
+table_index_scan_restrpos(IndexScanDesc scan)
+{
+	Assert(scan->xs_table_opaque && scan->usebatchring);
+	Assert(!scan->kill_prior_tuple);	/* not used with amgetbatch */
+
+	/*
+	 * Mark/restore only works correctly when there's at most one returnable
+	 * tuple per scan item, so that restoring the prior state at the scan item
+	 * granularity is sufficient.  Table AMs that can reach multiple row
+	 * versions through a single TID can generally only guarantee that under
+	 * MVCC snapshots (for heap, an MVCC-safe snapshot ensures that there's at
+	 * most one returnable tuple in each HOT chain).  Since the only current
+	 * user of mark/restore functionality is nodeMergejoin.c, this effectively
+	 * means that merge-join plans only work for MVCC snapshots.
+	 */
+	Assert(scan->MVCCScan);
+	scan->xs_heap_continue = false;
+
+	scan->heapRelation->rd_tableam->index_scan_restrpos(scan);
 }
 
 /*
@@ -1294,6 +1358,22 @@ table_index_scan_end(IndexScanDesc scan)
 	scan->heapRelation->rd_tableam->index_scan_end(scan);
 }
 
+/*
+ * Initialize table AM's per-batch opaque area within a batch allocation.
+ *
+ * Called by indexam_util_alloc_batch for each new or recycled batch, but only
+ * when the table AM reserved an opaque area for the scan (see the callback's
+ * documentation).
+ */
+static inline void
+table_index_scan_batch_init(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Assert(scan->xs_table_opaque && scan->usebatchring);
+	Assert(scan->batch_table_opaque_size > 0);
+
+	scan->heapRelation->rd_tableam->index_scan_batch_init(scan, batch);
+}
+
 /*
  * Return the next tuple from an index scan through `slot`, scanning in the
  * specified direction.  Returns true if a tuple satisfying the scan keys and
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 27a2c6815..0ac16b931 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1437,12 +1437,12 @@ typedef struct IndexOptInfo
 	bool		amoptionalkey;
 	bool		amsearcharray;
 	bool		amsearchnulls;
-	/* does AM have amgettuple interface? */
-	bool		amhasgettuple;
+	/* does AM have amgetbatch (or gettuple) interface? */
+	bool		amcanplainscan;
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e..2d9d04aa3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -264,6 +264,7 @@ brinhandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = true,
 		.amoptionalkey = true,
@@ -298,10 +299,12 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index e7cba81d4..0e8b6a549 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = true,
 		.amoptionalkey = true,
@@ -83,10 +84,12 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8565e225b..67b16053a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = true,
 		.amoptionalkey = true,
@@ -103,10 +104,12 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 8d8cd30dc..540f2bcd4 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -80,6 +80,7 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = true,
 		.amconsistentordering = false,
 		.amcanbackward = true,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = false,
 		.amoptionalkey = false,
@@ -114,10 +115,12 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f0e8d091a..361df73ad 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2683,8 +2683,11 @@ static const TableAmRoutine heapam_methods = {
 	.parallelscan_reinitialize = table_block_parallelscan_reinitialize,
 
 	.index_scan_begin = heapam_index_scan_begin,
-	.index_scan_reset = heapam_index_scan_reset,
+	.index_scan_batch_init = heapam_index_scan_batch_init,
+	.index_scan_rescan = heapam_index_scan_rescan,
 	.index_scan_end = heapam_index_scan_end,
+	.index_scan_markpos = heapam_index_scan_markpos,
+	.index_scan_restrpos = heapam_index_scan_restrpos,
 
 	.tuple_insert = heapam_tuple_insert,
 	.tuple_insert_speculative = heapam_tuple_insert_speculative,
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index 32d6aff1d..323c245cd 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/heapam.h"
+#include "access/indexbatch.h"
 #include "access/relscan.h"
 #include "access/visibilitymap.h"
 #include "storage/predicate.h"
@@ -24,6 +25,38 @@
 #include "utils/pgstat_internal.h"
 
 
+/*
+ * heapam's per-batch private opaque area (only used during index-only scans).
+ *
+ * Maintains a per-batch visibility information cache, populated on demand
+ * using the visibility map.  It is important that we set each batch item's
+ * batchvis[] entry exactly once (or not at all).
+ */
+typedef struct HeapBatchData
+{
+	/*
+	 * Range of batchvis[] entries with valid visibility info, in items[]-wise
+	 * terms.  An item's visibility is cached iff firstVisSet <= item <=
+	 * lastVisSet.
+	 */
+	int			firstVisSet;	/* first valid batchvis[] entry, or > last */
+	int			lastVisSet;		/* last valid batchvis[] entry, or < first */
+
+	/* maxitemsbatch-many all-visible states, valid only within range */
+	bool		batchvis[FLEXIBLE_ARRAY_MEMBER];
+} HeapBatchData;
+
+#define HEAP_BATCH_VIS_CACHED(hbatch, item) \
+	((item) >= (hbatch)->firstVisSet && (item) <= (hbatch)->lastVisSet)
+
+static bool heapam_index_plain_batch_getnext_slot(IndexScanDesc scan,
+												  ScanDirection direction,
+												  TupleTableSlot *slot,
+												  bool *recheck);
+static bool heapam_index_only_batch_getnext_slot(IndexScanDesc scan,
+												 ScanDirection direction,
+												 TupleTableSlot *slot,
+												 bool *recheck);
 static bool heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan,
 												  ScanDirection direction,
 												  TupleTableSlot *slot,
@@ -36,11 +69,28 @@ static pg_attribute_always_inline bool heapam_index_getnext_slot(IndexScanDesc s
 																 ScanDirection direction,
 																 TupleTableSlot *slot,
 																 bool index_only,
+																 bool amgetbatch,
 																 bool *recheck);
 static pg_attribute_always_inline bool heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 															   TupleTableSlot *slot, bool index_only,
 															   bool *heap_continue, bool *all_dead);
-static pg_attribute_always_inline void heapam_index_kill_item(IndexScanDesc scan);
+static pg_attribute_always_inline void heapam_index_kill_item(IndexScanDesc scan,
+															  bool amgetbatch);
+static pg_attribute_always_inline ItemPointer heapam_index_getnext_scanbatch_pos(IndexScanDesc scan,
+																				 IndexScanHeapData *hscan,
+																				 ScanDirection direction,
+																				 bool *all_visible);
+static inline ItemPointer heapam_index_return_scanpos_tid(IndexScanDesc scan,
+														  IndexScanHeapData *hscan,
+														  ScanDirection direction,
+														  IndexScanBatch scanBatch,
+														  BatchRingItemPos *scanPos,
+														  bool *all_visible);
+static void heapam_index_batch_pos_visibility(IndexScanDesc scan,
+											  ScanDirection direction,
+											  IndexScanBatch batch,
+											  HeapBatchData *hbatch,
+											  BatchRingItemPos *pos);
 
 /*
  * Simple, single-shot TID lookup for constraint enforcement code (unique
@@ -90,18 +140,52 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 {
 	IndexScanHeapData *hscan = palloc0_object(IndexScanHeapData);
 
-	hscan->xs_cbuf = InvalidBuffer;
+	/* table AM opaque area size is set below for index-only scans */
+	scan->batch_table_opaque_size = 0;
+
+	/* Current heap block state */
+	Assert(hscan->xs_cbuf == InvalidBuffer);
 	hscan->xs_blk = InvalidBlockNumber;
-	hscan->xs_vmbuffer = InvalidBuffer;
+
+	/* VM related state */
+	Assert(hscan->xs_vmbuffer == InvalidBuffer);
+	hscan->xs_vm_items = 1;
 
 	/* Remember if scan is read-only */
 	hscan->xs_readonly = (flags & SO_HINT_REL_READ_ONLY) != 0;
 
 	/* Resolve which xs_getnext_slot implementation to use for this scan */
-	if (scan->xs_want_itup)
-		scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot;
+	if (scan->indexRelation->rd_indam->amgetbatch != NULL)
+	{
+		/* amgetbatch index AM */
+		Assert(scan->maxitemsbatch > 0);
+
+		if (scan->xs_want_itup)
+		{
+			scan->xs_getnext_slot = heapam_index_only_batch_getnext_slot;
+
+			/*
+			 * Index-only scans cache visibility info in a HeapBatchData per
+			 * batch: a fixed-size header followed by a per-item batchvis[]
+			 * array (one bool per batch item)
+			 */
+			scan->batch_table_opaque_size = offsetof(HeapBatchData, batchvis) +
+				sizeof(bool) * scan->maxitemsbatch;
+		}
+		else
+			scan->xs_getnext_slot = heapam_index_plain_batch_getnext_slot;
+
+		/* Set up scan's batch ring buffer */
+		tableam_util_batchscan_init(scan);
+	}
 	else
-		scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot;
+	{
+		/* amgettuple index AM */
+		if (scan->xs_want_itup)
+			scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot;
+		else
+			scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot;
+	}
 
 	/*
 	 * Index-only scans that return "name" columns stored as cstrings need a
@@ -118,14 +202,44 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags)
 	scan->xs_table_opaque = hscan;
 }
 
+/*
+ * Initialize the heap table AM's per-batch opaque area
+ */
 void
-heapam_index_scan_reset(IndexScanDesc scan)
+heapam_index_scan_batch_init(IndexScanDesc scan, IndexScanBatch batch)
+{
+	HeapBatchData *hbatch;
+
+	/*
+	 * The core code should only call here during index-only scans.  Plain
+	 * scans don't request a HeapBatchData area at all.
+	 */
+	Assert(scan->xs_want_itup && scan->usebatchring);
+
+	/* Resetting the valid range makes it safe to use/recycle the batch */
+	hbatch = index_scan_batch_table_area(scan, batch);
+	hbatch->firstVisSet = INT_MAX;
+	hbatch->lastVisSet = -1;
+}
+
+/*
+ * Reset table AM index scan state in preparation for a rescan
+ */
+void
+heapam_index_scan_rescan(IndexScanDesc scan)
 {
 	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
 
+	/* Rescans should avoid an excessive number of VM lookups */
+	hscan->xs_vm_items = 1;
+
 	/* Heap fetches from the last rescan don't count towards this limit  */
 	hscan->xs_blkswitch_count = 0;
 
+	/* Reset batch ring buffer state */
+	if (scan->usebatchring)
+		tableam_util_batchscan_reset(scan, false);
+
 	/*
 	 * Deliberately avoid dropping pins now held in xs_cbuf and xs_vmbuffer.
 	 * This saves cycles during certain tight nested loop joins (it can avoid
@@ -150,9 +264,37 @@ heapam_index_scan_end(IndexScanDesc scan)
 	if (hscan->xs_itup_cxt)
 		MemoryContextDelete(hscan->xs_itup_cxt);
 
+	/* Free all batch related resources */
+	if (scan->usebatchring)
+		tableam_util_batchscan_end(scan);
+
 	pfree(hscan);
 }
 
+/*
+ * Save batch ring buffer's current scanPos as its markPos
+ */
+void
+heapam_index_scan_markpos(IndexScanDesc scan)
+{
+	Assert(scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amcanmarkpos);
+
+	tableam_util_batchscan_mark_pos(scan);
+}
+
+/*
+ * Restore batch ring buffer's markPos into its scanPos
+ */
+void
+heapam_index_scan_restrpos(IndexScanDesc scan)
+{
+	Assert(scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amcanmarkpos);
+
+	tableam_util_batchscan_restore_pos(scan);
+}
+
 /*
  *	heap_hot_search_buffer	- search HOT chain for tuple satisfying snapshot
  *
@@ -400,6 +542,34 @@ heap_fill_ios_slot(IndexScanDesc scan, TupleTableSlot *slot)
 		elog(ERROR, "no data returned for index-only scan");
 }
 
+/* xs_getnext_slot callback: amgetbatch, plain index scan */
+static pg_attribute_hot bool
+heapam_index_plain_batch_getnext_slot(IndexScanDesc scan,
+									  ScanDirection direction,
+									  TupleTableSlot *slot,
+									  bool *recheck)
+{
+	Assert(!scan->xs_want_itup && scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, false, true,
+									 recheck);
+}
+
+/* xs_getnext_slot callback: amgetbatch, index-only scan */
+static pg_attribute_hot bool
+heapam_index_only_batch_getnext_slot(IndexScanDesc scan,
+									 ScanDirection direction,
+									 TupleTableSlot *slot,
+									 bool *recheck)
+{
+	Assert(scan->xs_want_itup && scan->usebatchring);
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	return heapam_index_getnext_slot(scan, direction, slot, true, true,
+									 recheck);
+}
+
 /* xs_getnext_slot callback: amgettuple, plain index scan */
 static pg_attribute_hot bool
 heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan,
@@ -407,10 +577,11 @@ heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan,
 									  TupleTableSlot *slot,
 									  bool *recheck)
 {
-	Assert(!scan->xs_want_itup);
+	Assert(!scan->xs_want_itup && !scan->usebatchring);
 	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
 
-	return heapam_index_getnext_slot(scan, direction, slot, false, recheck);
+	return heapam_index_getnext_slot(scan, direction, slot, false, false,
+									 recheck);
 }
 
 /* xs_getnext_slot callback: amgettuple, index-only scan */
@@ -420,14 +591,15 @@ heapam_index_only_tuple_getnext_slot(IndexScanDesc scan,
 									 TupleTableSlot *slot,
 									 bool *recheck)
 {
-	Assert(scan->xs_want_itup);
+	Assert(scan->xs_want_itup && !scan->usebatchring);
 	Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
 
-	return heapam_index_getnext_slot(scan, direction, slot, true, recheck);
+	return heapam_index_getnext_slot(scan, direction, slot, true, false,
+									 recheck);
 }
 
 /*
- * Common implementation for both heapam_index_*_getnext_slot variants.
+ * Common implementation for all four heapam_index_*_getnext_slot variants.
  *
  * The result is true if a tuple satisfying the scan keys and the snapshot was
  * found, false otherwise.  On success the slot is filled: for plain index
@@ -439,13 +611,13 @@ heapam_index_only_tuple_getnext_slot(IndexScanDesc scan,
  * dropped by a future call here (or by a later call to heapam_index_scan_end
  * through index_endscan).
  *
- * The index_only parameter is a compile-time constant at each call site,
- * allowing the compiler to specialize the code for each variant.
+ * The index_only and amgetbatch parameters are compile-time constants at each
+ * call site, allowing the compiler to specialize the code for each variant.
  */
 static pg_attribute_always_inline bool
 heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 						  TupleTableSlot *slot, bool index_only,
-						  bool *recheck)
+						  bool amgetbatch, bool *recheck)
 {
 	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
 	bool	   *heap_continue = &scan->xs_heap_continue;
@@ -460,14 +632,22 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 		if (!*heap_continue)
 		{
 			/* Get the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			if (amgetbatch)
+				tid = heapam_index_getnext_scanbatch_pos(scan, hscan,
+														 direction,
+														 index_only ?
+														 &all_visible : NULL);
+			else
+				tid = tableam_util_fetch_next_tuple_tid(scan, direction);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
 				break;
 
-			/* For index-only scans, check the visibility map */
-			if (index_only)
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+
+			/* For non-batch index-only scans, check the visibility map */
+			if (index_only && !amgetbatch)
 				all_visible = VM_ALL_VISIBLE(scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid),
 											 &hscan->xs_vmbuffer);
@@ -493,7 +673,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 			{
 				/* No visible tuple */
 				if (all_dead)
-					heapam_index_kill_item(scan);
+					heapam_index_kill_item(scan, amgetbatch);
 
 				/*
 				 * If caller set a visited-pages limit (only selfuncs.c's
@@ -518,8 +698,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 			 * us to assume that just having one visible tuple in the hot
 			 * chain is always good enough.
 			 */
-			Assert(!index_only ||
-				   !(*heap_continue && IsMVCCSnapshot(scan->xs_snapshot)));
+			Assert(!index_only || !(*heap_continue && scan->MVCCScan));
 		}
 		else
 		{
@@ -644,7 +823,7 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
 			 * Only in a non-MVCC snapshot plain scan can more than one member
 			 * of the HOT chain be visible
 			 */
-			*heap_continue = !IsMVCCLikeSnapshot(snapshot);
+			*heap_continue = !scan->MVCCScan;
 
 			slot->tts_tableOid = RelationGetRelid(rel);
 			ExecStoreBufferHeapTuple(heapTuple, slot, hscan->xs_cbuf);
@@ -679,15 +858,303 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan,
  * RelationGetIndexScan().
  */
 static pg_attribute_always_inline void
-heapam_index_kill_item(IndexScanDesc scan)
+heapam_index_kill_item(IndexScanDesc scan, bool amgetbatch)
 {
 	if (scan->xactStartedInRecovery)
 		return;
 
-	/*
-	 * Tell amgettuple-based index AM to kill its entry for that TID.  The
-	 * next index_getnext_tid call will pass that along to the index AM,
-	 * before unsetting the flag again.
-	 */
-	scan->kill_prior_tuple = true;
+	if (amgetbatch)
+		tableam_util_scanpos_killitem(scan);
+	else
+	{
+		/*
+		 * Tell amgettuple-based index AM to kill its entry for that TID.  The
+		 * next tableam_util_fetch_next_tuple_tid call will pass that along to
+		 * the index AM, before unsetting the flag again.
+		 */
+		scan->kill_prior_tuple = true;
+	}
+}
+
+/*
+ * Get next TID from batch ring buffer, moving in the given scan direction.
+ * Also sets *all_visible for item when caller passes a non-NULL arg.
+ */
+static pg_attribute_always_inline ItemPointer
+heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexScanHeapData *hscan,
+								   ScanDirection direction, bool *all_visible)
+{
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	IndexScanBatch scanBatch;
+
+	Assert(all_visible == NULL || scan->xs_want_itup);
+
+	/*
+	 * Attempt to increment the position of any existing loaded scanBatch
+	 * (always fails on first call here for the scan)
+	 */
+	if (tableam_util_scanpos_advance(scan, direction, &scanBatch, scanPos))
+	{
+		/*
+		 * Incremented scanPos within existing loaded scanBatch; return the
+		 * new position's TID to caller
+		 */
+		return heapam_index_return_scanpos_tid(scan, hscan, direction,
+											   scanBatch, scanPos,
+											   all_visible);
+	}
+
+	/* Try to advance scanBatch to the next batch (or get the first batch) */
+	scanBatch = tableam_util_fetch_next_batch(scan, direction,
+											  scanBatch, scanPos);
+
+	if (!scanBatch)
+	{
+		/*
+		 * We're done; no more batches in the current scan direction.
+		 *
+		 * Note: if scanPos was ever valid (if amgetbatch ever returned a
+		 * batch), it remains valid now.  The current scanPos.item is now
+		 * scanBatch.lastItem + 1 (or scanBatch.firstItem - 1, when scanning
+		 * backwards).  We are therefore prepared to move the scan in the
+		 * opposite direction, within our still-loaded scanBatch.  Similarly,
+		 * we can still restore a saved mark in the usual way.
+		 */
+		return NULL;
+	}
+
+	/*
+	 * We have a new scanBatch, but scanPos hasn't been advanced to it just
+	 * yet.  Update batchringbuf and scanPos such that the scan can continue
+	 * with our new scanBatch.
+	 *
+	 * This will position scanPos to the start of our new scanBatch.  It will
+	 * also remove the old head batch/scanBatch from the batch ring buffer,
+	 * and release the underlying batch storage.
+	 */
+	tableam_util_scanpos_nextbatch(scan, direction, scanBatch);
+
+	/*
+	 * Set scanPos to first item for newly loaded scanBatch; return the new
+	 * position's TID to caller
+	 */
+	return heapam_index_return_scanpos_tid(scan, hscan, direction,
+										   scanBatch, scanPos, all_visible);
+}
+
+/*
+ * Save the current scanPos/scanBatch item's TID in scan's xs_heaptid, and
+ * return a pointer to that TID.  When all_visible isn't NULL (during an
+ * index-only scan), also sets item's visibility status in *all_visible.
+ *
+ * heapam_index_getnext_scanbatch_pos helper function.
+ */
+static inline ItemPointer
+heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan,
+								ScanDirection direction,
+								IndexScanBatch scanBatch,
+								BatchRingItemPos *scanPos,
+								bool *all_visible)
+{
+	HeapBatchData *hbatch;
+
+	/* Set xs_heaptid, which caller (and core executor) will need */
+	scan->xs_heaptid = scanBatch->items[scanPos->item].tableTid;
+
+	if (all_visible == NULL)
+	{
+		/*
+		 * Plain index scan.
+		 */
+		Assert(!scan->xs_want_itup);
+		return &scan->xs_heaptid;
+	}
+
+	/* Index-only scan */
+	Assert(scan->xs_want_itup);
+
+	scan->xs_itup = (IndexTuple) (scanBatch->currTuples +
+								  scanBatch->items[scanPos->item].tupleOffset);
+
+	/*
+	 * Set visibility info for the current scanPos item (plus possibly some
+	 * additional items in the current scan direction) as needed
+	 */
+	hbatch = index_scan_batch_table_area(scan, scanBatch);
+	if (!HEAP_BATCH_VIS_CACHED(hbatch, scanPos->item))
+		heapam_index_batch_pos_visibility(scan, direction,
+										  scanBatch, hbatch, scanPos);
+
+	/* Finally, set all_visible for caller */
+	*all_visible = hbatch->batchvis[scanPos->item];
+
+	return &scan->xs_heaptid;
+}
+
+/*
+ * Obtain visibility information for a TID from caller's batch.
+ *
+ * Called during amgetbatch index-only scans.  We always make sure that the
+ * visibility of caller's item (an offset into caller's batch->items[] array)
+ * has been set in its batch's batchvis[].  We might also set visibility info
+ * for other items from caller's batch more proactively when that makes sense.
+ *
+ * Every item has its batchvis[] entry set exactly once (or never).  We make
+ * sure that the scan has a fixed picture of which blocks it'll need to fetch
+ * in the near future.  If caller's position's item (or other nearby items)
+ * already have a valid batchvis[] entry, we must avoid clobbering that entry.
+ *
+ * We keep two competing considerations in balance when determining whether to
+ * check additional items: the need to keep the cost of visibility map access
+ * under control when most items will never be returned by the scan anyway
+ * (important for inner index scans of anti-joins and semi-joins), and the
+ * need to unguard batches promptly.
+ *
+ * Once we've resolved visibility for all items in a batch, we can safely
+ * unguard it by calling amunguardbatch.  This is safe with respect to
+ * concurrent VACUUM because the batch's guard (typically a buffer pin on the
+ * originating index page) blocks VACUUM from acquiring a conflicting cleanup
+ * lock on that page.  Copying the relevant visibility map data into our local
+ * cache suffices to prevent unsafe concurrent TID recycling: if any of these
+ * TIDs point to dead heap tuples, VACUUM cannot possibly return from
+ * ambulkdelete and mark the pointed-to heap pages as all-visible.  VACUUM
+ * _can_ do so once the batch is unguarded, but that's okay; we'll be working
+ * off of cached visibility info that indicates that the dead TIDs are NOT
+ * all-visible.
+ *
+ * What about the opposite case, where a page was all-visible when we cached
+ * the VM bits but tuples on it are deleted afterwards?  That is safe too: any
+ * tuple that was visible to all when we read the VM must also be visible to
+ * our MVCC snapshot, so it is correct to skip the heap fetch for those TIDs.
+ */
+static void
+heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction,
+								  IndexScanBatch batch, HeapBatchData *hbatch,
+								  BatchRingItemPos *pos)
+{
+	IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque;
+	int			posItem = pos->item;
+	int			loItem,
+				hiItem;
+	BlockNumber curvmheapblkno = InvalidBlockNumber;
+	bool		curvmheapallvis = false;
+
+	Assert(hbatch == index_scan_batch_table_area(scan, batch));
+
+	/*
+	 * The batch must still be guarded whenever we're called.
+	 *
+	 * amunguardbatch can't be called until we've already set _every_ batch
+	 * item's batchvis[] status, but if we've already done so for this batch
+	 * then it shouldn't ever get passed to us again by some subsequent call.
+	 * (This relies on index-only scans always being !batchImmediateUnguard.)
+	 */
+	Assert(batch->isGuarded && !scan->batchImmediateUnguard);
+
+	/*
+	 * Set visibility info for a range of items, in scan order.
+	 *
+	 * Note: visibilitymap_get_status does not lock the visibility map buffer,
+	 * so the result could be slightly stale.  See the "Memory ordering
+	 * effects" discussion above visibilitymap_get_status for an explanation
+	 * of why this is okay.
+	 */
+	if (ScanDirectionIsForward(direction))
+	{
+		int			lastSetItem = Min(batch->lastItem,
+									  posItem + hscan->xs_vm_items - 1);
+
+		for (int setItem = posItem; setItem <= lastSetItem; setItem++)
+		{
+			ItemPointer tid = &batch->items[setItem].tableTid;
+			BlockNumber heapblkno = ItemPointerGetBlockNumber(tid);
+
+			/* Must never overwrite any batch item's cached visibility info */
+			if (HEAP_BATCH_VIS_CACHED(hbatch, setItem))
+				continue;
+
+			if (heapblkno != curvmheapblkno)
+			{
+				curvmheapallvis = VM_ALL_VISIBLE(scan->heapRelation, heapblkno,
+												 &hscan->xs_vmbuffer);
+				curvmheapblkno = heapblkno;
+			}
+
+			hbatch->batchvis[setItem] = curvmheapallvis;
+		}
+
+		/* We just cached visibility for items [posItem, lastSetItem] */
+		loItem = posItem;
+		hiItem = lastSetItem;
+	}
+	else
+	{
+		int			lastSetItem = Max(batch->firstItem,
+									  posItem - hscan->xs_vm_items + 1);
+
+		for (int setItem = posItem; setItem >= lastSetItem; setItem--)
+		{
+			ItemPointer tid = &batch->items[setItem].tableTid;
+			BlockNumber heapblkno = ItemPointerGetBlockNumber(tid);
+
+			/* Must never overwrite any batch item's cached visibility info */
+			if (HEAP_BATCH_VIS_CACHED(hbatch, setItem))
+				continue;
+
+			if (heapblkno != curvmheapblkno)
+			{
+				curvmheapallvis = VM_ALL_VISIBLE(scan->heapRelation, heapblkno,
+												 &hscan->xs_vmbuffer);
+				curvmheapblkno = heapblkno;
+			}
+
+			hbatch->batchvis[setItem] = curvmheapallvis;
+		}
+
+		/* We just cached visibility for items [lastSetItem, posItem] */
+		loItem = lastSetItem;
+		hiItem = posItem;
+	}
+
+	/*
+	 * Extend the batch's valid range to cover the items we just cached.  The
+	 * set of cached items is always contiguous, because the scan visits items
+	 * in order and only ever extends the range from firstItem forwards or
+	 * from lastItem backwards.
+	 */
+	Assert(hbatch->firstVisSet > hbatch->lastVisSet ||	/* still contiguous? */
+		   (loItem <= hbatch->lastVisSet + 1 &&
+			hiItem >= hbatch->firstVisSet - 1));
+	hbatch->firstVisSet = Min(hbatch->firstVisSet, loItem);
+	hbatch->lastVisSet = Max(hbatch->lastVisSet, hiItem);
+
+	/*
+	 * It's safe to unguard the batch (via amunguardbatch) as soon as we've
+	 * resolved the visibility status of all of its items (unless this is a
+	 * non-MVCC scan)
+	 */
+	if (hbatch->firstVisSet <= batch->firstItem &&
+		hbatch->lastVisSet >= batch->lastItem)
+	{
+		Assert(hbatch->firstVisSet == batch->firstItem &&
+			   hbatch->lastVisSet == batch->lastItem);
+
+		/*
+		 * Note: nodeIndexonlyscan.c only supports MVCC snapshots, but we
+		 * still cope with index-only scan callers with other snapshot types.
+		 * This is certainly not unexpected; selfuncs.c performs index-only
+		 * scans that use SnapshotNonVacuumable.
+		 */
+		if (scan->MVCCScan)
+			tableam_util_unguard_batch(scan, batch);
+	}
+
+	/*
+	 * Else check visibility for twice as many items next time, or all items.
+	 * We check all items in one go once we're passed the scan's first batch.
+	 */
+	else if (hscan->xs_vm_items < (batch->lastItem - batch->firstItem))
+		hscan->xs_vm_items *= 2;
+	else
+		hscan->xs_vm_items = scan->maxitemsbatch;
 }
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/amapi.c b/src/backend/access/index/amapi.c
index efa007030..d4adbbeb2 100644
--- a/src/backend/access/index/amapi.c
+++ b/src/backend/access/index/amapi.c
@@ -55,6 +55,11 @@ GetIndexAmRoutine(Oid amhandler)
 	Assert(routine->amrescan != NULL);
 	Assert(routine->amendscan != NULL);
 
+	/* Assert that AM doesn't have an invalid combination of callbacks */
+	Assert((routine->amgetbatch != NULL) == (routine->amunguardbatch != NULL));
+	Assert(routine->amkillitemsbatch == NULL || routine->amgetbatch != NULL);
+	Assert(routine->amgetbatch != NULL || routine->amposreset == NULL);
+
 	return routine;
 }
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 1512438d6..54042f6f5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->usebatchring = false; /* set later for amgetbatch callers */
+	memset(&scan->batchcache, 0, sizeof(scan->batchcache));
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -128,6 +130,11 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 
 	scan->xs_getnext_slot = NULL;
 
+	scan->batch_index_opaque_static = 0;
+	scan->batch_tuples_workspace = 0;
+	scan->batch_table_opaque_size = 0;
+	scan->batch_base_offset = 0;
+
 	scan->xs_name_cstring_attnums = NULL;
 	scan->xs_name_cstring_count = 0;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aa0d4b143..c3e8e54fb 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -13,18 +13,15 @@
  * INTERFACE ROUTINES
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
- *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan - start a scan of an index with amgetbatch/amgettuple
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
  *		index_insert	- insert an index tuple into a relation
- *		index_markpos	- mark a scan position
- *		index_restrpos	- restore a scan position
  *		index_parallelscan_estimate - estimate shared memory for parallel scan
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -42,6 +39,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/indexbatch.h"
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -254,7 +252,7 @@ index_insert_cleanup(Relation indexRelation,
 }
 
 /*
- * index_beginscan - start a scan of an index with amgettuple
+ * index_beginscan - start a scan of an index with amgetbatch/amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
  */
@@ -339,6 +337,7 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation,
 	scan->xs_temp_snap = temp_snap;
 
 	scan->xs_snapshot = snapshot;
+	scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);
 	scan->instrument = instrument;
 
 	/*
@@ -350,6 +349,7 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation,
 		scan->heapRelation = heapRelation;
 		scan->xs_want_itup = index_only_scan;
 		scan->xs_heap_continue = false;
+		scan->batchImmediateUnguard = (scan->MVCCScan && !index_only_scan);
 
 		/*
 		 * For index-only scans, find any "name" columns stored as cstrings
@@ -394,6 +394,14 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation,
 		Assert(scan->xs_getnext_slot != NULL && scan->xs_table_opaque != NULL);
 	}
 
+	/*
+	 * Bitmap index scans should never use a batch ring buffer (though can use
+	 * the scan's batch cache).  Plain index scans (and index-only scans)
+	 * should only use a batch ring buffer with an amgetbatch index AM.
+	 */
+	Assert(!scan->xs_table_opaque ? !scan->usebatchring :
+		   (indexRelation->rd_indam->amgetbatch != NULL) == scan->usebatchring);
+
 	return scan;
 }
 
@@ -420,9 +428,9 @@ index_rescan(IndexScanDesc scan,
 	Assert(nkeys == scan->numberOfKeys);
 	Assert(norderbys == scan->numberOfOrderBys);
 
-	/* reset table AM state for rescan */
+	/* tell the table AM that there's to be a rescan */
 	if (scan->xs_table_opaque)
-		table_index_scan_reset(scan);
+		table_index_scan_rescan(scan);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -441,7 +449,21 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
-	/* Release resources (like buffer pins) from table accesses */
+	/*
+	 * amgetbitmap scans of an index AM that supports amgetbatch make limited
+	 * use of the scan's batch cache.  Check for that.
+	 */
+	if (!scan->usebatchring && scan->batchcache[0] != NULL)
+	{
+		Assert(scan->xs_table_opaque == NULL);
+		Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+		pfree(index_scan_batch_base(scan, scan->batchcache[0]));
+	}
+
+	/*
+	 * Release resources (like buffer pins and batch ring buffer) held by
+	 * table AM for index scan
+	 */
 	if (scan->xs_table_opaque)
 	{
 		table_index_scan_end(scan);
@@ -461,52 +483,6 @@ index_endscan(IndexScanDesc scan)
 	IndexScanEnd(scan);
 }
 
-/* ----------------
- *		index_markpos  - mark a scan position
- * ----------------
- */
-void
-index_markpos(IndexScanDesc scan)
-{
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
-
-	scan->indexRelation->rd_indam->ammarkpos(scan);
-}
-
-/* ----------------
- *		index_restrpos	- restore a scan position
- *
- * NOTE: this only restores the internal scan state of the index AM.  See
- * comments for ExecRestrPos().
- *
- * NOTE: For heap, in the presence of HOT chains, mark/restore only works
- * correctly if the scan's snapshot is MVCC-safe; that ensures that there's at
- * most one returnable tuple in each HOT chain, and so restoring the prior
- * state at the granularity of the index AM is sufficient.  Since the only
- * current user of mark/restore functionality is nodeMergejoin.c, this
- * effectively means that merge-join plans only work for MVCC snapshots.  This
- * could be fixed if necessary, but for now it seems unimportant.
- * ----------------
- */
-void
-index_restrpos(IndexScanDesc scan)
-{
-	Assert(IsMVCCLikeSnapshot(scan->xs_snapshot));
-
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
-
-	/* reset table AM state for restoring the marked position */
-	if (scan->xs_table_opaque)
-		table_index_scan_reset(scan);
-
-	scan->kill_prior_tuple = false; /* for safety */
-	scan->xs_heap_continue = false;
-
-	scan->indexRelation->rd_indam->amrestrpos(scan);
-}
-
 /*
  * Estimates the shared memory needed for parallel scan, including any
  * AM-specific parallel scan state.
@@ -584,9 +560,9 @@ index_parallelrescan(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
 
-	/* reset table AM state for rescan */
+	/* tell the table AM that there's to be a rescan */
 	if (scan->xs_table_opaque)
-		table_index_scan_reset(scan);
+		table_index_scan_rescan(scan);
 
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
@@ -622,56 +598,6 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 									index_only_scan, true, flags);
 }
 
-/* ----------------
- * index_getnext_tid - amgettuple interface
- *
- * The result is the next TID satisfying the scan keys,
- * or NULL if no more matching tuples exist.
- *
- * This should only be called by table AM amgettuple-based index scan
- * callbacks.
- * ----------------
- */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
-{
-	bool		found;
-
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
-
-	/* XXX: we should assert that a snapshot is pushed or registered */
-	Assert(TransactionIdIsValid(RecentXmin));
-
-	/*
-	 * The AM's amgettuple proc finds the next index entry matching the scan
-	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
-	 * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
-	 * pay no attention to those fields here.
-	 */
-	found = scan->indexRelation->rd_indam->amgettuple(scan, direction);
-
-	/* Reset kill flag immediately for safety */
-	scan->kill_prior_tuple = false;
-	scan->xs_heap_continue = false;
-
-	/* If we're out of index entries, we're done */
-	if (!found)
-	{
-		/* reset table AM state */
-		if (scan->xs_table_opaque)
-			table_index_scan_reset(scan);
-
-		return NULL;
-	}
-	Assert(ItemPointerIsValid(&scan->xs_heaptid));
-
-	pgstat_count_index_tuples(scan->indexRelation, 1);
-
-	/* Return the TID of the tuple we found. */
-	return &scan->xs_heaptid;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..e58e09897
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,798 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  Batch-based index scan infrastructure for the amgetbatch interface.
+ *
+ * This module provides the core infrastructure for batch-based index scans,
+ * which allow index AMs to return multiple matching TIDs per page in a single
+ * call.  The batch ring buffer is owned by the table AM.
+ *
+ * The ring buffer loads batches in index key space/index scan order.
+ *
+ * Most functions here are table AM utilities (tableam_util_*), called by
+ * table AMs during amgetbatch index scans.  These manage the batch ring
+ * buffer's lifecycle and positional state, and help with certain aspects of
+ * resource management.  The table AM uses scanPos to return items from
+ * batches returned by amgetbatch.
+ *
+ * There are also some index AM utilities (indexam_util_*), called by index
+ * AMs that implement the amgetbatch interface, to help manage resources like
+ * memory, locks, and buffer pins.  Index AMs free and unlock batches as
+ * described in indexam.sgml.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/indexbatch.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+#include "utils/memdebug.h"
+
+static void batch_cache_mark_undefined(IndexScanDesc scan, IndexScanBatch batch);
+static void release_and_unguard_batch(IndexScanDesc scan, IndexScanBatch batch,
+									  bool allow_cache);
+static inline bool batch_cache_store(IndexScanDesc scan, IndexScanBatch batch);
+static int	batch_compare_int(const void *va, const void *vb);
+
+/*
+ * Return the size of the single allocation backing one of this scan's batches
+ * for assertions/custom Valgrind batch instrumentation
+ */
+#if defined(USE_VALGRIND) || defined(USE_ASSERT_CHECKING)
+static size_t
+batch_alloc_size(IndexScanDesc scan)
+{
+	size_t		allocsz;
+
+	Assert(scan->batch_base_offset > 0);
+
+	allocsz = scan->batch_base_offset +
+		MAXALIGN(offsetof(IndexScanBatchData, items) +
+				 sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+	if (scan->xs_want_itup)
+		allocsz += scan->batch_tuples_workspace;
+
+	return allocsz;
+}
+#endif
+
+/*
+ * Make Valgrind treat a batch's entire allocation as undefined memory
+ */
+static void
+batch_cache_mark_undefined(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_VALGRIND
+	char	   *currTuples = batch->currTuples;
+	int		   *deadItems = batch->deadItems;
+
+	VALGRIND_MAKE_MEM_UNDEFINED(index_scan_batch_base(scan, batch),
+								batch_alloc_size(scan));
+	if (deadItems)
+		VALGRIND_MAKE_MEM_UNDEFINED(deadItems,
+									sizeof(int) * scan->maxitemsbatch);
+
+	/* preserve pointers to now-undefined currTuples and deadItems buffers */
+	batch->currTuples = currTuples;
+	batch->deadItems = deadItems;
+#endif
+}
+
+/*
+ * Reset ring buffer and related positional state used during an amgetbatch
+ * index scan.
+ *
+ * Table AM caller should pass endscan=false, which makes us cache any freed
+ * batches for reuse on rescan.  We release scan's markBatch here either way.
+ */
+void
+tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	bool		markBatchFreed = false;
+
+	batchringbuf->scanPos.valid = false;
+	batchringbuf->markPos.valid = false;
+
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		if (batch == markBatch)
+			markBatchFreed = true;
+
+		release_and_unguard_batch(scan, batch, !endscan);
+	}
+
+	if (!markBatchFreed && unlikely(markBatch))
+		release_and_unguard_batch(scan, markBatch, !endscan);
+
+	batchringbuf->headBatch = 0;
+	batchringbuf->nextBatch = 0;
+	batchringbuf->markBatch = NULL;
+}
+
+/*
+ * Free resources at end of a batch index scan.
+ *
+ * Called by table AM when an index scan is ending, right before the owning
+ * scan descriptor goes away.  Cleans up all batch related resources.
+ */
+void
+tableam_util_batchscan_end(IndexScanDesc scan)
+{
+	/* Free all remaining loaded batches (even markBatch), bypassing cache */
+	tableam_util_batchscan_reset(scan, true);
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		IndexScanBatch cached = scan->batchcache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->deadItems)
+			pfree(cached->deadItems);
+		pfree(index_scan_batch_base(scan, cached));
+	}
+}
+
+/*
+ * Set a mark from scanPos position
+ *
+ * Called from the table AM's index_scan_markpos callback.  Saves the current
+ * scan position and associated batch so that the scan can be restored to this
+ * point later, via tableam_util_batchscan_restore_pos.  The marked batch is
+ * retained and not freed until a new mark is set or the scan ends (or until
+ * the mark is restored).
+ */
+void
+tableam_util_batchscan_mark_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos PG_USED_FOR_ASSERTS_ONLY = &batchringbuf->markPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+
+	Assert(scan->indexRelation->rd_indam->amcanmarkpos);
+	Assert(scan->MVCCScan);
+	Assert(batchringbuf->headBatch == scanPos->batch);	/* see below */
+
+	/*
+	 * A mark must point at a real matching item.  We require the core
+	 * executor to only take a mark just after a successful tuple fetch.
+	 */
+	Assert(scanPos->valid);
+	Assert(scanPos->item >= scanBatch->firstItem &&
+		   scanPos->item <= scanBatch->lastItem);
+
+	/* Free the previous mark batch? */
+	if (!markBatch || markBatch == scanBatch)
+	{
+		/* No older markBatch that needs to be freed now */
+	}
+	else
+	{
+		/*
+		 * Have a markBatch that isn't in batchringbuf; it was saved when
+		 * tableam_util_release_batch was asked to release it earlier on.
+		 *
+		 * Note: this assumes that "batchringbuf->headBatch == scanPos->batch"
+		 * is invariant.  In other words, it assumes that table AMs always
+		 * remove an obsolescent scanBatch from the ring buffer at the point
+		 * where they step off its underlying batch.
+		 */
+		Assert(!markBatch->isGuarded);
+		Assert(!index_scan_batch_loaded(scan, markPos->batch) ||
+			   index_scan_batch(scan, markPos->batch) != markBatch);
+
+		release_and_unguard_batch(scan, markBatch, true);
+	}
+
+	batchringbuf->markPos = *scanPos;
+	batchringbuf->markBatch = scanBatch;
+}
+
+/*
+ * Restore scanPos to the previously saved markPos position.
+ *
+ * Called from the table AM's index_scan_restrpos callback.  Restores the
+ * scan to a position saved using tableam_util_batchscan_mark_pos earlier.
+ * The scan's markPos becomes its scanPos.  The marked batch is restored as
+ * the current scanBatch when needed.
+ *
+ * We just discard all batches (other than markBatch/restored scanBatch),
+ * except when markBatch is already the scan's current scanBatch.
+ *
+ * Note: This relies on the assumption that we already have a valid scanPos.
+ * Table AMs should only call tableam_util_batchscan_reset from within their
+ * scan's index_scan_restrpos callback to avoid breaking this assumption.
+ */
+void
+tableam_util_batchscan_restore_pos(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	BatchRingItemPos *markPos = &batchringbuf->markPos;
+	IndexScanBatch markBatch = batchringbuf->markBatch;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	Assert(scan->indexRelation->rd_indam->amcanmarkpos);
+	Assert(scan->MVCCScan);
+	Assert(scan->xs_table_opaque);
+
+	/*
+	 * The core executor must only ask us to restore a mark when it already
+	 * had us take one on its behalf at some point during the ongoing scan
+	 */
+	Assert(markPos->valid);
+	Assert(markPos->item >= markBatch->firstItem &&
+		   markPos->item <= markBatch->lastItem);
+
+	if (scanBatch == markBatch)
+	{
+		/* markBatch is already scanBatch; needn't change batchringbuf */
+		Assert(scanPos->batch == markPos->batch);
+
+		scanPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * A batch is always unguarded by the time the scan moves on to a later
+	 * batch, so markBatch (now behind scanBatch) cannot still be guarded
+	 */
+	Assert(!markBatch->isGuarded);
+
+	/*
+	 * markBatch is behind scanBatch, and so must not be saved in ring buffer
+	 * anymore.  We have to deal with restoring the mark the hard way: by
+	 * invalidating all other loaded batches.  This is similar to the case
+	 * where the scan direction changes and the scan actually crosses
+	 * batch/index page boundaries (see tableam_util_scanbatch_dirchange).
+	 *
+	 * First, free all batches that are still in the ring buffer.
+	 */
+	for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++)
+	{
+		IndexScanBatch batch = index_scan_batch(scan, i);
+
+		Assert(batch != markBatch);
+
+		tableam_util_release_batch(scan, batch);
+	}
+
+	/*
+	 * Next "append" standalone markBatch, which will become scanBatch
+	 * (scanBatch is always the ring buffer's headBatch)
+	 */
+	markPos->batch = 0;
+	batchringbuf->scanPos = *markPos;
+	batchringbuf->nextBatch = batchringbuf->headBatch = markPos->batch;
+	index_scan_batch_append(scan, markBatch);
+	Assert(index_scan_batch(scan, batchringbuf->scanPos.batch) == markBatch);
+
+	/*
+	 * Finally, call amposreset to let index AM know to invalidate any private
+	 * state that independently tracks the scan's progress
+	 */
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Note: markBatch.deadItems[] might already contain dead items, and might
+	 * yet have more dead items saved.  tableam_util_release_batch is prepared
+	 * for that.
+	 */
+}
+
+/*
+ * Handle cross-batch change in scan direction
+ *
+ * Called by table AM when its scan changes direction in a way that
+ * necessitates backing the scan up to an index page originally associated
+ * with a now-freed batch.
+ *
+ * When we return, batchringbuf will only contain one batch (the current
+ * headBatch/scanBatch) and will look as if the new scan direction had been
+ * used from the start.  Caller can then safely pass this batch to amgetbatch
+ * to determine which batch comes next in the new scan direction.  This
+ * approach isn't particularly efficient, but it works well enough for what
+ * ought to be a relatively rare occurrence.
+ */
+void
+tableam_util_scanbatch_dirchange(IndexScanDesc scan)
+{
+	BatchRingBuffer *batchringbuf = &scan->batchringbuf;
+	IndexScanBatch scanBatch;
+
+	Assert(scan->indexRelation->rd_indam->amcanbackward);
+
+	/*
+	 * Release batches starting from the current "final" batch, working
+	 * backwards until the current head batch (which is also the current
+	 * scanBatch) is the only batch hasn't been freed
+	 */
+	while (index_scan_batch_count(scan) > 1)
+	{
+		uint8		finalidx = batchringbuf->nextBatch - 1;
+		IndexScanBatch final = index_scan_batch(scan, finalidx);
+
+		Assert(finalidx != batchringbuf->scanPos.batch);
+
+		tableam_util_release_batch(scan, final);
+		batchringbuf->nextBatch--;
+	}
+
+	/* scanBatch is now the only batch still loaded */
+	Assert(batchringbuf->headBatch == batchringbuf->scanPos.batch);
+	scanBatch = index_scan_batch(scan, batchringbuf->headBatch);
+
+	/*
+	 * Flip scanBatch's scan direction to reflect the reversal.  Also reset
+	 * any index AM state that independently tracks scan progress.
+	 */
+	scanBatch->dir = -scanBatch->dir;
+	if (scan->indexRelation->rd_indam->amposreset)
+		scan->indexRelation->rd_indam->amposreset(scan, scanBatch);
+}
+
+/*
+ * Record that scanPos item is dead
+ *
+ * Records an offset to the current scanBatch/scanPos item, saving it in
+ * scanBatch's deadItems array.  The items' index tuples will later be
+ * marked LP_DEAD when current scanBatch is freed.
+ */
+void
+tableam_util_scanpos_killitem(IndexScanDesc scan)
+{
+	BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos;
+	IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch);
+
+	if (scanBatch->deadItems == NULL)
+		scanBatch->deadItems = palloc_array(int, scan->maxitemsbatch);
+	if (scanBatch->numDead < scan->maxitemsbatch)
+		scanBatch->deadItems[scanBatch->numDead++] = scanPos->item;
+}
+
+/*
+ * Release resources associated with a batch
+ *
+ * Called by table AM's amgetbatch index scan implementation when it is
+ * finished with a batch and wishes to release its resources.
+ *
+ * Calling here when 'batch' is also batchringbuf.markBatch is a no-op.  Table
+ * AM callers generally won't need to worry about this because it is handled
+ * as a special case by the functions in this module (besides, the scan can
+ * only have one markBatch at a time).
+ *
+ * We call amunguardbatch to drop the TID recycling interlock (e.g. buffer
+ * pin) when it hasn't been dropped yet.  For plain MVCC scans (where
+ * batchImmediateUnguard is set), the interlock was already dropped eagerly
+ * in indexam_util_unlock_batch, so we skip the amunguardbatch call here.
+ * Index-only scans must delay dropping the interlock until visibility is
+ * resolved for all items in the batch, so amunguardbatch may still need to
+ * act here.  For non-MVCC snapshot scans, the interlock is always held
+ * until amunguardbatch drops it here -- this is the only place willing to
+ * unguard a non-MVCC scan's batch.
+ *
+ * When the batch has dead items (numDead > 0) and the index AM provides an
+ * amkillitemsbatch callback, we call it to set LP_DEAD bits in the index
+ * page.  This is the natural place to kill index items because it's the
+ * point when we know for sure that no further table accesses will take
+ * place for that batch's items.
+ */
+void
+tableam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* don't free caller's batch if it is scan's current markBatch */
+	if (batch == scan->batchringbuf.markBatch)
+		return;
+
+	/* Pass through to implementation function, with allow_cache=true */
+	release_and_unguard_batch(scan, batch, true);
+}
+
+/*
+ * Free a batch, optionally caching it for reuse.
+ *
+ * When allow_cache is true, we try to store the batch in the scan's batch
+ * cache for later reuse.  When allow_cache is false (typically because the
+ * scan is shutting down), we pfree the caller's batch unconditionally.
+ */
+static void
+release_and_unguard_batch(IndexScanDesc scan, IndexScanBatch batch,
+						  bool allow_cache)
+{
+	Assert(!(scan->batchImmediateUnguard && batch->isGuarded));
+	Assert(batch->isGuarded || scan->MVCCScan);
+
+	/* Drop TID recycling interlock via amunguardbatch as needed */
+	if (!scan->batchImmediateUnguard && batch->isGuarded)
+		tableam_util_unguard_batch(scan, batch);
+
+	/*
+	 * Let the index AM set LP_DEAD bits in the index page, if applicable.
+	 *
+	 * batch.deadItems[] is now in whatever order the scan returned items in.
+	 * We might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify deadItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numDead > 0 &&
+		scan->indexRelation->rd_indam->amkillitemsbatch != NULL)
+	{
+		if (batch->numDead > 1)
+		{
+			qsort(batch->deadItems, batch->numDead, sizeof(int),
+				  batch_compare_int);
+			batch->numDead = qunique(batch->deadItems, batch->numDead,
+									 sizeof(int), batch_compare_int);
+		}
+
+		scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch);
+	}
+
+	/*
+	 * Try to store caller's batch in this amgetbatch scan's cache of
+	 * previously released batches first (when caller requests it)
+	 */
+	if (allow_cache && batch_cache_store(scan, batch))
+		return;
+
+	/* just pfree the caller's batch (plus batch's deadItems, if any) */
+	if (batch->deadItems)
+		pfree(batch->deadItems);
+	pfree(index_scan_batch_base(scan, batch));
+}
+
+/*
+ * Drop the batch's TID recycling interlock via amunguardbatch
+ *
+ * Called by the table AM when it's safe to drop whatever interlock the index
+ * AM holds to prevent unsafe concurrent TID recycling by VACUUM (typically a
+ * buffer pin on the batch's index page in batch's opaque area).
+ */
+void
+tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	scan->indexRelation->rd_indam->amunguardbatch(scan, batch);
+
+	batch->isGuarded = false;
+}
+
+/*
+ * Unlock batch's index page buffer lock
+ *
+ * Unlocks the given buffer in preparation for amgetbatch returning items
+ * saved in that batch.  Performs extra steps required by amgetbatch callers
+ * in passing.
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbatch and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We drop both
+ * the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ *
+ * For amgetbatch callers, when batchImmediateUnguard is set (plain MVCC
+ * scans), we also release the pin here (the TID recycling interlock).  The
+ * batch will be marked "unguarded", preventing the table AM from spuriously
+ * calling amunguardbatch later on.
+ *
+ * Index AMs whose TID recycling interlock is not just a buffer pin, or whose
+ * amunguardbatch does not simply release a pin, are not obligated to use this
+ * function.  They can implement their own equivalent.  Such index AMs are also
+ * free to use the batch LSN field themselves; their amkillitemsbatch routine
+ * can use that LSN in the usual way, or in whatever way the AM deems necessary
+ * (core code will not use it for any other purpose).
+ */
+pg_attribute_hot void
+indexam_util_unlock_batch(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (scan->usebatchring)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amkillitemsbatch callback has a way
+		 * to detect when concurrent table TID recycling by VACUUM might have
+		 * taken place.  It'll only be safe for amkillitemsbatch to set index
+		 * tuple LP_DEAD bits when the page LSN hasn't advanced between then
+		 * and now.
+		 */
+		batch->lsn = BufferGetLSNAtomic(buf);
+
+		/*
+		 * Drop the pin here during scans that don't require an explicit TID
+		 * recycling interlock (a pin will block cleanup lock acquisition by
+		 * index vacuuming)
+		 */
+		if (scan->batchImmediateUnguard)
+		{
+			/* drop both the lock and the pin */
+			UnlockReleaseBuffer(buf);
+			batch->isGuarded = false;	/* won't call amunguardbatch */
+		}
+		else
+		{
+			/*
+			 * just drop the lock; index AM's amunguardbatch callback will be
+			 * called to drop the pin later on, when the table AM determines
+			 * that it is safe to do so
+			 */
+			UnlockBuffer(buf);
+			batch->isGuarded = true;
+		}
+	}
+	else
+	{
+		/* amgetbitmap (not amgetbatch) caller */
+		Assert(scan->heapRelation == NULL);
+
+		/*
+		 * drop both the lock and the pin (amunguardbatch is never called
+		 * during bitmap index scans)
+		 */
+		UnlockReleaseBuffer(buf);
+	}
+}
+
+/*
+ * Allocate a new batch
+ *
+ * Used by index AMs that support amgetbatch interface (both during amgetbatch
+ * and amgetbitmap scans).
+ *
+ * Returns IndexScanBatch with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_release_batch.  See
+ * comments above indexam_util_release_batch.
+ *
+ * Housekeeping fields (buf, knownEndBackward/Forward, firstItem, lastItem,
+ * numDead, deadItems, currTuples) are initialized here.  The table AM's
+ * batch_init callback is invoked here to initialize the table AM opaque area.
+ * The index AM caller is responsible for filling in its per-batch opaque
+ * fields and the matching items[] array.
+ *
+ * Once the batch has the required matching items, caller should generally
+ * pass it to indexam_util_unlock_batch, ahead of it being returned through
+ * index AM's amgetbatch routine.  If it turns out that the batch won't need
+ * to be returned like this (e.g., due to the scan having no more matches),
+ * caller should pass its empty/unused batch to indexam_util_release_batch.
+ */
+pg_attribute_hot IndexScanBatch
+indexam_util_alloc_batch(IndexScanDesc scan)
+{
+	IndexScanBatch batch = NULL;
+
+	/* Index AM must have set its opaque space to something already */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->batch_index_opaque_static > 0);
+
+	/* First look for an existing batch from the cache */
+	if (scan->usebatchring)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchcache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchcache[i];
+				scan->batchcache[i] = NULL;
+				break;
+			}
+		}
+	}
+	else if (scan->batchcache[0] != NULL)
+	{
+		/*
+		 * Reuse cached batch from prior amgetbitmap iteration.  This path is
+		 * hit on every amgetbitmap call here after the scan's first.
+		 */
+		batch = scan->batchcache[0];
+		scan->batchcache[0] = NULL;
+	}
+
+	if (!batch)
+	{
+		size_t		opaque_areas_prefix_sz,
+					base_sz,
+					ios_total_trailing_sz,
+					allocsz;
+		char	   *raw_batch_alloc;
+
+		if (scan->batch_base_offset == 0)
+		{
+			/* We lazily compute batch_base_offset on scan's first call */
+			size_t		table_area = 0;
+
+			if (scan->usebatchring)
+			{
+				/*
+				 * Handle table AM's dynamically-sized area.  It isn't used
+				 * during batch-based bitmap scans...
+				 */
+				table_area = MAXALIGN(scan->batch_table_opaque_size);
+			}
+
+			/* ...though we always need an index AM area */
+			scan->batch_base_offset = table_area +
+				scan->batch_index_opaque_static;
+		}
+
+		/* Subtotal #1: the size of all AM opaque areas */
+		opaque_areas_prefix_sz = scan->batch_base_offset;
+		Assert(opaque_areas_prefix_sz == MAXALIGN(opaque_areas_prefix_sz));
+
+		/* Subtotal #2: IndexScanBatchData and its items[maxitemsbatch] */
+		base_sz = MAXALIGN(offsetof(IndexScanBatchData, items) +
+						   sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+
+		/*
+		 * Subtotal #3: the currTuples workspace that comes after items[],
+		 * where the index AM stores index tuples during index-only scans
+		 */
+		ios_total_trailing_sz = 0;
+		if (scan->xs_want_itup)
+		{
+			ios_total_trailing_sz = scan->batch_tuples_workspace;
+			pg_assume(ios_total_trailing_sz > 0);
+			Assert(ios_total_trailing_sz == MAXALIGN(ios_total_trailing_sz));
+		}
+
+		/* Total batch allocation size is the sum of our three subtotals */
+		allocsz = opaque_areas_prefix_sz + base_sz + ios_total_trailing_sz;
+		Assert(allocsz == batch_alloc_size(scan));
+		raw_batch_alloc = palloc(allocsz);
+		batch = (IndexScanBatch) (raw_batch_alloc + opaque_areas_prefix_sz);
+		Assert(index_scan_batch_base(scan, batch) == raw_batch_alloc);
+
+		/* currTuples (if any) is directly after items[] */
+		batch->currTuples = NULL;
+		if (ios_total_trailing_sz)
+			batch->currTuples = (char *) batch + base_sz;
+		batch->deadItems = NULL;
+	}
+
+	Assert(scan->batch_base_offset > 0);
+
+	/*
+	 * Let the table AM initialize its per-batch opaque area iff it requested
+	 * one (which can't happen during batch-based bitmap index scans)
+	 */
+	if (scan->usebatchring && scan->batch_table_opaque_size > 0)
+		table_index_scan_batch_init(scan, batch);
+
+	/* initialize shared batch fields */
+	batch->dir = NoMovementScanDirection;
+	batch->knownEndBackward = false;
+	batch->knownEndForward = false;
+	batch->isGuarded = false;
+
+	/* "firstItem <= lastItem" tests will fail at first (defensive) */
+	batch->firstItem = 0;
+	batch->lastItem = -1;
+
+	/*
+	 * deadItems[] might already be allocated iff this is a recycled batch.
+	 * Either way, it starts out with zero valid killable items.
+	 */
+	batch->numDead = 0;
+
+	return batch;
+}
+
+/*
+ * Release allocated batch
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_alloc_batch.  Batches are cached here for reuse to reduce
+ * palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by tableam_util_release_batch, which calls here after the
+ * index AM's amkillitemsbatch routine (if any).  Index AMs that use batches
+ * should call here to release a batch from their amgetbatch or amgetbitmap
+ * routines.
+ *
+ * The rules for batch ownership differ slightly for amgetbitmap scans; see
+ * the amgetbitmap documentation in doc/src/sgml/indexam.sgml for details.
+ */
+void
+indexam_util_release_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	if (!scan->usebatchring)
+	{
+		/*
+		 * amgetbitmap scan caller.
+		 *
+		 * amgetbitmap routines are required to allocate no more than one
+		 * batch at a time, so we'll always have a free slot.
+		 */
+		Assert(scan->batchcache[0] == NULL);
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->deadItems == NULL);
+		Assert(batch->currTuples == NULL);
+
+		batch_cache_mark_undefined(scan, batch);
+		scan->batchcache[0] = batch;
+		return;
+	}
+
+	/* amgetbatch scan caller */
+	Assert(scan->heapRelation != NULL);
+
+	/*
+	 * Try to store caller's batch in this amgetbatch scan's cache of
+	 * previously released batches first
+	 */
+	if (batch_cache_store(scan, batch))
+		return;
+
+	/* Cache full; just free the caller's batch */
+	if (batch->deadItems)
+		pfree(batch->deadItems);
+	pfree(index_scan_batch_base(scan, batch));
+}
+
+/*
+ * Try to store a batch in the scan's batch cache.
+ *
+ * Returns true if a free slot was found, false if the cache is full.
+ */
+static inline bool
+batch_cache_store(IndexScanDesc scan, IndexScanBatch batch)
+{
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchcache[i] == NULL)
+		{
+			batch_cache_mark_undefined(scan, batch);
+			scan->batchcache[i] = batch;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index cb921ca2e..a37869b71 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -179,18 +179,15 @@ hold on to the pin (used when reading from the leaf page) until _after_
 they're done visiting the heap (for TIDs from pinned leaf page) prevents
 concurrent TID recycling.  VACUUM cannot get a conflicting cleanup lock
 until the index scan is totally finished processing its leaf page.
+This is required by any index AM that implements the amgetbatch
+interface.  (See also, doc/src/sgml/indexam.sgml).
 
-This approach is fairly coarse, so we avoid it whenever possible.  In
-practice most index scans won't hold onto their pin, and so won't block
-VACUUM.  These index scans must deal with TID recycling directly, which is
-more complicated and not always possible.  See later section on making
-concurrent TID recycling safe.
-
-Opportunistic index tuple deletion performs almost the same page-level
-modifications while only holding an exclusive lock.  This is safe because
-there is no question of TID recycling taking place later on -- only VACUUM
-can make TIDs recyclable.  See also simple deletion and bottom-up
-deletion, below.
+Opportunistic index tuple deletion performs the same page-level
+modifications as VACUUM, while only holding an exclusive lock.  This is
+safe because there is no question of TID recycling taking place -- only
+VACUUM can make TIDs recyclable.  In other words, VACUUM's cleanup lock
+serves to protect non-MVCC snapshot scans from concurrent TID recycling
+hazards; it doesn't protect the B-Tree structure itself.
 
 Because a pin is not always held, and a page can be split even while
 someone does hold a pin on it, it is possible that an indexscan will
@@ -440,54 +437,6 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
-Making concurrent TID recycling safe
-------------------------------------
-
-As explained in the earlier section about deleting index tuples during
-VACUUM, we implement a locking protocol that allows individual index scans
-to avoid concurrent TID recycling.  Index scans opt-out (and so drop their
-leaf page pin when visiting the heap) whenever it's safe to do so, though.
-Dropping the pin early is useful because it avoids blocking progress by
-VACUUM.  This is particularly important with index scans used by cursors,
-since idle cursors sometimes stop for relatively long periods of time.  In
-extreme cases, a client application may hold on to an idle cursors for
-hours or even days.  Blocking VACUUM for that long could be disastrous.
-
-Index scans that don't hold on to a buffer pin are protected by holding an
-MVCC snapshot instead.  This more limited interlock prevents wrong answers
-to queries, but it does not prevent concurrent TID recycling itself (only
-holding onto the leaf page pin while accessing the heap ensures that).
-
-Index-only scans can never drop their buffer pin, since they are unable to
-tolerate having a referenced TID become recyclable.  Index-only scans
-typically just visit the visibility map (not the heap proper), and so will
-not reliably notice that any stale TID reference (for a TID that pointed
-to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
-the heap by VACUUM.  This could easily allow VACUUM to set the whole heap
-page to all-visible in the visibility map immediately afterwards.  An MVCC
-snapshot is only sufficient to avoid problems during plain index scans
-because they must access granular visibility information from the heap
-proper.  A plain index scan will even recognize LP_UNUSED items in the
-heap (items that could be recycled but haven't been just yet) as "not
-visible" -- even when the heap page is generally considered all-visible.
-
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
-(described in full in simple deletion, below) is also more complicated for
-index scans that drop their leaf page pins.  We must be careful to avoid
-LP_DEAD-marking any new index tuple that looks like a known-dead index
-tuple because it happens to share the same TID, following concurrent TID
-recycling.  It's just about possible that some other session inserted a
-new, unrelated index tuple, on the same leaf page, which has the same
-original TID.  It would be totally wrong to LP_DEAD-set this new,
-unrelated index tuple.
-
-We handle this kill_prior_tuple race condition by having affected index
-scans conservatively assume that any change to the leaf page at all
-implies that it was reached by btbulkdelete in the interim period when no
-buffer pin was held.  This is implemented by not setting any LP_DEAD bits
-on the leaf page at all when the page's LSN has changed.  (This is why we
-implement "fake" LSNs for unlogged index relations.)
-
 Fastpath For Index Insertion
 ----------------------------
 
@@ -734,7 +683,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
@@ -756,9 +705,8 @@ non-MVCC scans is not required on standby nodes. We still get a full
 cleanup lock when replaying VACUUM records during recovery, but recovery
 does not need to lock every leaf page (only those leaf pages that have
 items to delete) -- that's sufficient to avoid breaking index-only scans
-during recovery (see section above about making TID recycling safe). That
-leaves concern only for plain index scans. (XXX: Not actually clear why
-this is totally unnecessary during recovery.)
+during recovery. That leaves concern only for plain index scans.
+(XXX: Not actually clear why this is totally unnecessary during recovery.)
 
 MVCC snapshot plain index scans are always safe, for the same reasons that
 they're safe during original execution.  HeapTupleSatisfiesToast() doesn't
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 054703861..b9c279508 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1060,6 +1060,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: _bt_batch_unlock in nbtsearch.c (indexam_util_unlock_batch wrapper
+ * function) has matching Valgrind buffer lock instrumentation.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
@@ -1101,13 +1104,19 @@ _bt_unlockbuf(Relation rel, Buffer buf)
 	 * Buffer is pinned and locked, which means that it is expected to be
 	 * defined and addressable.  Check that proactively.
 	 */
-	VALGRIND_CHECK_MEM_IS_DEFINED(BufferGetPage(buf), BLCKSZ);
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
+
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
 
 	/* LockBuffer() asserts that pin is held by this backend */
 	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
+#if defined(USE_VALGRIND)
 	if (!RelationUsesLocalBuffers(rel))
-		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ);
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 448a51412..2c6a3d26b 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+								  OffsetNumber offnum, const ItemPointerData *tableTid,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer tableTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,11 +131,12 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btnewbatch = BTBatchGetData(scan, newbatch);
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
@@ -144,23 +145,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btnewbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = btnewbatch->currPage = BufferGetBlockNumber(btnewbatch->buf);
+	btnewbatch->prevPage = opaque->btpo_prev;
+	btnewbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? btnewbatch->moreRight : btnewbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +186,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->nextPage,
+								 btnewbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, btnewbatch->prevPage,
+								 btnewbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +210,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					btnewbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +278,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +337,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			btnewbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +358,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					btnewbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+													   btnewbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +463,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +499,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			btnewbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +520,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1023,91 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	newbatch->items[itemIndex].tableTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *tableTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer tableTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->tableTid = *tableTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2821,14 +2818,15 @@ new_prim_scan:
 	 *
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
-	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * The scan direction might be reversed during the next amgetbatch call,
+	 * but not before a call to btposreset that resets the array keys to the
+	 * first positions/elements used when scanning in this other direction.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2841,9 +2839,8 @@ end_toplevel_scan:
 	 * This ends the entire top-level scan in the current scan direction.
 	 *
 	 * Note: The scan's arrays (including any non-required arrays) are now in
-	 * their final positions for the current scan direction.  If the scan
-	 * direction happens to change, then the arrays will already be in their
-	 * first positions for what will then be the current scan direction.
+	 * their final positions for the current scan direction.  This is just
+	 * defensive.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = false;	/* ...and don't call _bt_first again */
@@ -2910,17 +2907,9 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	/*
 	 * The array keys are now exhausted.
 	 *
-	 * Restore the array keys to the state they were in immediately before we
-	 * were called.  This ensures that the arrays only ever ratchet in the
-	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
+	 * Defensively restore the array keys to the positions they were in
+	 * immediately before we were called (i.e. to their final positions for
+	 * the current scan direction).
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3df2c752e..5939e728f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -128,6 +128,7 @@ bthandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = true,
 		.amconsistentordering = true,
 		.amcanbackward = true,
+		.amcanmarkpos = true,
 		.amcanunique = true,
 		.amcanmulticol = true,
 		.amoptionalkey = true,
@@ -161,11 +162,13 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amunguardbatch = btunguardbatch,
+		.amkillitemsbatch = btkillitemsbatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -224,18 +227,18 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	IndexScanBatch batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
 	/* btree indexes are never lossy */
-	scan->xs_recheck = false;
+	Assert(!scan->xs_recheck);
 
 	/* Each loop iteration performs another primitive index scan */
 	do
@@ -243,45 +246,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/*
 		 * If we've already initialized this scan, we can just advance it in
 		 * the appropriate direction.  If we haven't done so yet, we call
-		 * _bt_first() to get the first item in the scan.
+		 * _bt_first() to get the first batch in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -291,38 +269,43 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
-	ItemPointer heapTid;
+	ItemPointer tableTid;
 
 	Assert(scan->heapRelation == NULL);
 
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
-			tbm_add_tuples(tbm, heapTid, 1, false);
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			tableTid = &batch->items[itemIndex].tableTid;
+			tbm_add_tuples(tbm, tableTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
-					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					/*
+					 * _bt_next releases the prior batch for bitmap callers
+					 * before allocating the next one, so only one batch is
+					 * ever used at a time
+					 */
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
-				tbm_add_tuples(tbm, heapTid, 1, false);
+				tableTid = &batch->items[itemIndex].tableTid;
+				tbm_add_tuples(tbm, tableTid, 1, false);
 				ntids++;
 			}
 		}
@@ -349,8 +332,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -364,19 +345,12 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_recheck = false;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
+	scan->batch_index_opaque_static = MAXALIGN(sizeof(BTBatchData));
+	scan->batch_tuples_workspace = BLCKSZ;
 
 	return scan;
 }
@@ -390,64 +364,177 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCLikeSnapshot(scan->xs_snapshot) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btunguardbatch() -- Drop batch's TID recycling interlock (buffer pin)
+ *
+ * Called by the table AM when it's safe to drop the buffer pin held to
+ * prevent concurrent TID recycling by VACUUM.
+ */
+void
+btunguardbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
+
+	/* Should be called exactly once iff !batchImmediateUnguard */
+	Assert(!scan->batchImmediateUnguard);
+	Assert(batch->isGuarded);
+
+	ReleaseBuffer(btbatch->buf);
+}
+
+/*
+ *	btkillitemsbatch() -- Mark dead items' index tuples LP_DEAD
+ */
+void
+btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	Relation	rel = scan->indexRelation;
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
+	Page		page;
+	BTPageOpaque opaque;
+	bool		killedsomething = false;
+	Buffer		buf;
+	XLogRecPtr	latestlsn;
+
+	Assert(batch->numDead > 0);
+
+	buf = _bt_getbuf(rel, btbatch->currPage, BT_READ);
+
+	latestlsn = BufferGetLSNAtomic(buf);
+	Assert(batch->lsn <= latestlsn);
+	if (batch->lsn != latestlsn)
+	{
+		/* Modified, give up on hinting */
+		_bt_relbuf(rel, buf);
+		return;
+	}
+
+	page = BufferGetPage(buf);
+	opaque = BTPageGetOpaque(page);
+
+	/* Iterate through batch->deadItems[] in leaf page order */
+	for (int i = 0; i < batch->numDead; i++)
+	{
+		int			itemIndex = batch->deadItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
+		OffsetNumber offnum = kitem->indexOffset;
+		ItemId		iid = PageGetItemId(page, offnum);
+		IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+		bool		killtuple = false;
+
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
+		Assert(i == 0 ||
+			   offnum >= batch->items[batch->deadItems[i - 1]].indexOffset);
+		Assert(offnum >= P_FIRSTDATAKEY(opaque) &&
+			   offnum <= PageGetMaxOffsetNumber(page));
+
+		if (BTreeTupleIsPosting(ituple))
+		{
+			int			pi = i + 1;
+			int			nposting = BTreeTupleGetNPosting(ituple);
+			int			j;
+
+			/*
+			 * A posting list tuple can only be marked LP_DEAD once every one
+			 * of its heap TIDs is known dead.  Match each heap TID against
+			 * successive dead items, which are in heap TID order just like
+			 * the posting list itself.
+			 */
+			for (j = 0; j < nposting; j++)
+			{
+				ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+				if (!ItemPointerEquals(item, &kitem->tableTid))
+					break;		/* out of posting list loop */
+
+				Assert(kitem->indexOffset == offnum);
+
+				/*
+				 * Read-ahead to later kitems here.
+				 *
+				 * We rely on the assumption that not advancing kitem here
+				 * will prevent us from considering the posting list tuple
+				 * fully dead by not matching its next heap TID in next loop
+				 * iteration.
+				 *
+				 * If, on the other hand, this is the final heap TID in the
+				 * posting list tuple, then tuple gets killed regardless (i.e.
+				 * we handle the case where the last kitem is also the last
+				 * heap TID in the last index tuple correctly -- posting tuple
+				 * still gets killed).
+				 */
+				if (pi < batch->numDead)
+					kitem = &batch->items[batch->deadItems[pi++]];
+			}
+
+			/*
+			 * Don't bother advancing the outermost loop's int iterator to
+			 * avoid processing dead items that relate to the same offnum/
+			 * posting list tuple.  This micro-optimization hardly seems worth
+			 * it.  (Further iterations of the outermost loop will fail to
+			 * match on this same posting list's first heap TID instead, so
+			 * we'll advance to the next offnum/index tuple pretty quickly.)
+			 */
+			if (j == nposting)
+				killtuple = true;
+		}
+		else
+		{
+			Assert(ItemPointerEquals(&ituple->t_tid, &kitem->tableTid));
+			killtuple = true;
+		}
+
+		/* Mark index item as dead, if it isn't already */
+		if (killtuple && !ItemIdIsDead(iid))
+		{
+			if (!killedsomething)
+			{
+				/*
+				 * Use the hint bit infrastructure to check if we can update
+				 * the page while just holding a share lock. If we are not
+				 * allowed, there's no point continuing.
+				 */
+				if (!BufferBeginSetHintBits(buf))
+				{
+					_bt_relbuf(rel, buf);
+					return;
+				}
+			}
+
+			/* found the item/all posting list items */
+			ItemIdMarkDead(iid);
+			killedsomething = true;
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 *
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
+	 * only rely on the page-level flag in !heapkeyspace indexes.)
+	 */
+	if (killedsomething)
+	{
+		opaque->btpo_flags |= BTP_HAS_GARBAGE;
+		BufferFinishSetHintBits(buf, true, true);
+	}
+
+	_bt_relbuf(rel, buf);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -456,116 +543,63 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- reset array key state for scan position change
+ *
+ * Called by the core system when the scan's logical position is about to
+ * change in a way that invalidates our array key state.  This happens when
+ * restoring a marked position, or when the scan crosses a batch boundary
+ * while moving in the opposite direction to the one originally used.
+ *
+ * For direction changes, the core system will have already flipped the
+ * batch's dir field before calling here; we use this updated direction when
+ * resetting our array keys.  For mark restoration, the batch's dir will
+ * retain its original value (from when btgetbatch returned it).
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch batch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btbatch = BTBatchGetData(scan, batch);
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Reset array keys to initial state for the batch's scan direction.  Also
+	 * clear needPrimScan and related flags.  These were set based on the soft
+	 * assumption that the scan would always proceed in the same direction.
+	 *
+	 * These steps work around the soft assumption being violated: they force
+	 * the scan to step to the next/previous page, making the arrays recover.
+	 * When we go to read that page, _bt_readpage will reliably determine if a
+	 * primitive scan really is needed based on the page's tuples.  If there's
+	 * a primitive scan, it will reposition the scan using new array values
+	 * (based on the tuples from the neighboring page we'll step on to).
+	 *
+	 * We need to reset the array key state in the correct direction so that
+	 * we won't get confused.  When the array keys are behind the key space
+	 * for the page we're stepping on to (behind in terms of the scan dir),
+	 * they will catch up automatically.  But when they're ahead of that
+	 * page's key space, the scan could miss matching tuples.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, batch->dir);
+	if (ScanDirectionIsForward(batch->dir))
+		btbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		btbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -884,15 +918,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1042,8 +1067,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aae6acb7f..4c94b9e59 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -23,53 +23,48 @@
 #include "pgstat.h"
 #include "storage/predicate.h"
 #include "utils/lsyscache.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
+static inline void _bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch,
+									Buffer buf);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
 /*
- *	_bt_drop_lock_and_maybe_pin()
+ * _bt_batch_unlock() -- nbtree wrapper for indexam_util_unlock_batch.
  *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ * Performs the same Valgrind instrumentation as _bt_unlockbuf.
  */
 static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
+_bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf)
 {
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
+#if defined(USE_VALGRIND)
+	Page		page = BufferGetPage(buf);
 
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
+	VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ);
+#endif
+
+	indexam_util_unlock_batch(scan, batch, buf);
+
+#if defined(USE_VALGRIND)
+	if (!RelationUsesLocalBuffers(scan->indexRelation))
+		VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
+#endif
 }
 
 /*
@@ -860,26 +855,25 @@ _bt_compare(Relation rel,
 }
 
 /*
- *	_bt_first() -- Find the first item in a scan.
+ *	_bt_first() -- Find the first batch in a scan.
  *
  *		We need to be clever about the direction of scan, the search
- *		conditions, and the tree ordering.  We find the first item (or,
- *		if backwards scan, the last item) in the tree that satisfies the
- *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		conditions, and the tree ordering.  We find the first leaf page (or
+ *		the last leaf page, when scanning backwards) in the tree with at least
+ *		one tuple that satisfies the qualifications in the scan key.  On
+ *		success exit, we return a new batch with that page's matching items.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index (in the given scan direction),
+ * we just return NULL.  Note that returning NULL doesn't necessarily mean the
+ * end of the top-level scan; caller should check so->needPrimScan to
+ * determine if another primitive index scan is required.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -892,8 +886,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
+	BTBatchData *btfirstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -909,7 +903,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
-		return false;
+		return NULL;
 	}
 
 	/*
@@ -918,7 +912,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return NULL;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -938,11 +932,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		Assert(!so->needPrimScan);
 		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1502,17 +1492,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		default:
 			/* can't get here, but keep compiler quiet */
 			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return NULL;
 	}
 
+	/* Allocate space for first batch before locking anything */
+	firstbatch = indexam_util_alloc_batch(scan);
+	btfirstbatch = BTBatchGetData(scan, firstbatch);
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+	_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(btfirstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1528,22 +1522,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			_bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false);
+			_bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(btfirstbatch->buf))
 		{
 			_bt_parallel_done(scan);
-			return false;
+			indexam_util_release_batch(scan, firstbatch);
+			return NULL;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, btfirstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1561,164 +1556,72 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
- *	_bt_next() -- Get the next item in a scan.
+ *	_bt_next() -- Get the next batch in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which leaf page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
- *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
+ *		Note that returning NULL doesn't necessarily mean the end of the
+ *		top-level scan; caller should check so->needPrimScan to determine
+ *		if another primitive index scan is required.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btpriorbatch = BTBatchGetData(scan, priorbatch);
 	BlockNumber blkno,
 				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
+	bool		moreInDir;
 
 	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
+	 * The core code must deal with cross-batch scan direction changes for us.
+	 * A batch management routine that flips priorbatch's scan direction (and
+	 * calls btposreset to deal with the scan's array keys) is used for this.
 	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(priorbatch->dir == dir);
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = btpriorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = btpriorbatch->prevPage;
+	lastcurrblkno = btpriorbatch->currPage;
+	moreInDir = ScanDirectionIsForward(dir) ?
+		btpriorbatch->moreRight : btpriorbatch->moreLeft;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * For bitmap scan callers, release the prior batch now so that
+	 * _bt_readnextpage can reuse its memory.  That way bitmap scans never
+	 * need more than one batch allocation.
 	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
+	if (!scan->usebatchring)
+		indexam_util_release_batch(scan, priorbatch);
+
+	if (blkno == P_NONE || !moreInDir)
+	{
+		/*
+		 * priorbatch's page is known to be the final leaf page with matches
+		 * in this scan direction (its _bt_readpage call figured that out).
+		 *
+		 * Note: if so->needPrimScan is set, then priorbatch's leaf page is
+		 * actually just the final page for the current primitive index scan
+		 * in this scan direction (the scan will continue in _bt_first).
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
 
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
@@ -1732,178 +1635,169 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
- * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
+ * On success exit, returns unlocked batch containing data from the next page
+ * that has at least one matching item.  If there are no matching items in the
+ * given scan direction, we just return NULL.  Note that returning NULL
+ * doesn't necessarily mean the end of the top-level scan; btgetbatch and
+ * btgetbitmap check so->needPrimScan to determine if another primitive index
+ * scan is required.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTBatchData *btfirstbatch = BTBatchGetData(scan, firstbatch);
+	BlockNumber blkno,
+				lastcurrblkno;
+	bool		moreInDir;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		btfirstbatch->moreLeft = false;
+		btfirstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		btfirstbatch->moreLeft = true;
+		btfirstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		_bt_batch_unlock(scan, firstbatch, btfirstbatch->buf);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page returned by _bt_search */
+	_bt_relbuf(scan->indexRelation, btfirstbatch->buf);
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = btfirstbatch->nextPage;
+	else
+		blkno = btfirstbatch->prevPage;
+	lastcurrblkno = btfirstbatch->currPage;
+	moreInDir = ScanDirectionIsForward(dir) ?
+		btfirstbatch->moreRight : btfirstbatch->moreLeft;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	/* Release firstbatch (will be recycled if we reach _bt_readnextpage) */
+	indexam_util_release_batch(scan, firstbatch);
+
+	if (blkno == P_NONE || !moreInDir)
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
  *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
  *
- * Caller's blkno is the next interesting page's link, taken from either the
- * previously-saved right link or left link.  lastcurrblkno is the page that
- * was current at the point where the blkno link was saved, which we use to
- * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * Caller's blkno is the prior batch's nextPage or prevPage (depending on the
+ * current scan direction), and lastcurrblkno is the prior batch's currPage.
+ * We use lastcurrblkno to reason about concurrent page splits/page deletions
+ * during backwards scans.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns unlocked batch containing data from the next page
+ * that has at least one matching item.  If there are no more matching items
+ * in the given scan direction, we just return NULL.  Note that returning NULL
+ * doesn't necessarily mean the end of the top-level scan; btgetbatch and
+ * btgetbitmap check so->needPrimScan to determine if another primitive index
+ * scan is required.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTBatchData *btnewbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for new batch before locking anything */
+	newbatch = indexam_util_alloc_batch(scan);
+	btnewbatch = BTBatchGetData(scan, newbatch);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for blkno, a page to the right of
+	 * lastcurrblkno (or to the left, when the scan is moving backwards).
+	 *
+	 * Note: caller's blkno is tentative.  newbatch actually stores matches
+	 * from the next leaf page in this scan direction that has at least one
+	 * matching item.  This is usually caller's blkno page, but might be some
+	 * other page to its right (or to its left) instead.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	btnewbatch->moreLeft = true;	/* for lastcurrblkno (or tentative) */
+	btnewbatch->moreRight = true;	/* tentative (or for lastcurrblkno) */
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			btnewbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+			btnewbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
 														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			if (btnewbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_release_batch(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(btnewbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1911,17 +1805,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = btnewbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = btnewbatch->prevPage;
 			}
 		}
 		else
@@ -1936,19 +1830,38 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, btnewbatch->buf);
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !btnewbatch->moreRight : !btnewbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_release_batch(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_release_batch(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(btnewbatch->currPage == blkno);
+	_bt_batch_unlock(scan, newbatch, btnewbatch->buf);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2174,25 +2087,29 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
+static IndexScanBatch
 _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch firstbatch;
+	BTBatchData *btfirstbatch;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+
+	/* Allocate space for first batch before locking anything */
+	firstbatch = indexam_util_alloc_batch(scan);
+	btfirstbatch = BTBatchGetData(scan, firstbatch);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	btfirstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(btfirstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2200,10 +2117,11 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
 		_bt_parallel_done(scan);
-		return false;
+		indexam_util_release_batch(scan, firstbatch);
+		return NULL;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(btfirstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2229,9 +2147,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 014faa162..b6c977b96 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -19,10 +19,7 @@
 
 #include "access/nbtree.h"
 #include "access/reloptions.h"
-#include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "storage/lwlock.h"
 #include "storage/subsystems.h"
@@ -31,7 +28,6 @@
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -146,247 +142,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	return key;
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
-/*
- * _bt_killitems - set LP_DEAD state for items an indexscan caller has
- * told us were killed
- *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
- *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
- *
- * We match items by heap TID before assuming they are the right ones to set
- * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
- * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
- * page, so the page's TIDs can't have been recycled by now.  There's no risk
- * that we'll confuse a new index tuple that happens to use a recycled TID
- * with a now-removed tuple with the same TID (that used to be on this same
- * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
- * the page LSN having not changed since back when _bt_readpage saw the page.
- * We totally give up on setting LP_DEAD bits when the page LSN changed.
- *
- * We give up much less often during !so->dropPin scans, but it still happens.
- * We cope with cases where items have moved right due to insertions.  If an
- * item has moved off the current page due to a split, we'll fail to find it
- * and just give up on it.
- */
-void
-_bt_killitems(IndexScanDesc scan)
-{
-	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber minoff;
-	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
-	bool		killedsomething = false;
-	Buffer		buf;
-
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
-
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
-	{
-		/*
-		 * We have held the pin on this page since we read the index tuples,
-		 * so all we need to do is lock it.  The pin will have prevented
-		 * concurrent VACUUMs from recycling any of the TIDs on the page.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
-		_bt_lockbuf(rel, buf, BT_READ);
-	}
-	else
-	{
-		XLogRecPtr	latestlsn;
-
-		Assert(!BTScanPosIsPinned(so->currPos));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
-		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
-		{
-			/* Modified, give up on hinting */
-			_bt_relbuf(rel, buf);
-			return;
-		}
-
-		/* Unmodified, hinting is safe */
-	}
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
-	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
-		OffsetNumber offnum = kitem->indexOffset;
-
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
-		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
-
-		if (offnum < minoff)
-			continue;			/* pure paranoia */
-		while (offnum <= maxoff)
-		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
-			bool		killtuple = false;
-
-			if (BTreeTupleIsPosting(ituple))
-			{
-				int			pi = i + 1;
-				int			nposting = BTreeTupleGetNPosting(ituple);
-				int			j;
-
-				/*
-				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
-				 * possible that this posting list tuple wasn't a posting list
-				 * tuple when we first encountered its heap TIDs.
-				 */
-				for (j = 0; j < nposting; j++)
-				{
-					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
-
-					if (!ItemPointerEquals(item, &kitem->heapTid))
-						break;	/* out of posting list loop */
-
-					/*
-					 * kitem must have matching offnum when heap TIDs match,
-					 * though only in the common case where the page can't
-					 * have been concurrently modified
-					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
-
-					/*
-					 * Read-ahead to later kitems here.
-					 *
-					 * We rely on the assumption that not advancing kitem here
-					 * will prevent us from considering the posting list tuple
-					 * fully dead by not matching its next heap TID in next
-					 * loop iteration.
-					 *
-					 * If, on the other hand, this is the final heap TID in
-					 * the posting list tuple, then tuple gets killed
-					 * regardless (i.e. we handle the case where the last
-					 * kitem is also the last heap TID in the last index tuple
-					 * correctly -- posting tuple still gets killed).
-					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
-				}
-
-				/*
-				 * Don't bother advancing the outermost loop's int iterator to
-				 * avoid processing killed items that relate to the same
-				 * offnum/posting list tuple.  This micro-optimization hardly
-				 * seems worth it.  (Further iterations of the outermost loop
-				 * will fail to match on this same posting list's first heap
-				 * TID instead, so we'll advance to the next offnum/index
-				 * tuple pretty quickly.)
-				 */
-				if (j == nposting)
-					killtuple = true;
-			}
-			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
-				killtuple = true;
-
-			/*
-			 * Mark index item as dead, if it isn't already.  Since this
-			 * happens while holding a buffer lock possibly in shared mode,
-			 * it's possible that multiple processes attempt to do this
-			 * simultaneously, leading to multiple full-page images being sent
-			 * to WAL (if wal_log_hints or data checksums are enabled), which
-			 * is undesirable.
-			 */
-			if (killtuple && !ItemIdIsDead(iid))
-			{
-				if (!killedsomething)
-				{
-					/*
-					 * Use the hint bit infrastructure to check if we can
-					 * update the page while just holding a share lock. If we
-					 * are not allowed, there's no point continuing.
-					 */
-					if (!BufferBeginSetHintBits(buf))
-						goto unlock_page;
-				}
-
-				/* found the item/all posting list items */
-				ItemIdMarkDead(iid);
-				killedsomething = true;
-				break;			/* out of inner search loop */
-			}
-			offnum = OffsetNumberNext(offnum);
-		}
-	}
-
-	/*
-	 * Since this can be redone later if needed, mark as dirty hint.
-	 *
-	 * Whenever we mark anything LP_DEAD, we also set the page's
-	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
-	 * only rely on the page-level flag in !heapkeyspace indexes.)
-	 */
-	if (killedsomething)
-	{
-		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		BufferFinishSetHintBits(buf, true, true);
-	}
-
-unlock_page:
-	if (!so->dropPin)
-		_bt_unlockbuf(rel, buf);
-	else
-		_bt_relbuf(rel, buf);
-}
-
-
 /*
  * The following routines manage a shared-memory area in which we track
  * assignment of "vacuum cycle IDs" to currently-active btree vacuuming
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dff7d286f..3bc5e5ccd 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1095,15 +1095,15 @@ btree_mask(char *pagedata, BlockNumber blkno)
 		/*
 		 * In btree leaf pages, it is possible to modify the LP_FLAGS without
 		 * emitting any WAL record. Hence, mask the line pointer flags. See
-		 * _bt_killitems(), _bt_check_unique() for details.
+		 * btkillitemsbatch(), _bt_check_unique() for details.
 		 */
 		mask_lp_flags(page);
 	}
 
 	/*
 	 * BTP_HAS_GARBAGE is just an un-logged hint bit. So, mask it. See
-	 * _bt_delete_or_dedup_one_page(), _bt_killitems(), and _bt_check_unique()
-	 * for details.
+	 * _bt_delete_or_dedup_one_page(), btkillitemsbatch(), and
+	 * _bt_check_unique() for details.
 	 */
 	maskopaq->btpo_flags &= ~BTP_HAS_GARBAGE;
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index f2ee333f6..745435da3 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -54,6 +54,7 @@ spghandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = false,
 		.amoptionalkey = true,
@@ -88,10 +89,12 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 72d2c662b..42dc36df5 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -51,8 +51,11 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->parallelscan_reinitialize != NULL);
 
 	Assert(routine->index_scan_begin != NULL);
-	Assert(routine->index_scan_reset != NULL);
+	Assert(routine->index_scan_rescan != NULL);
 	Assert(routine->index_scan_end != NULL);
+	Assert(routine->index_scan_batch_init != NULL);
+	Assert(routine->index_scan_markpos != NULL);
+	Assert(routine->index_scan_restrpos != NULL);
 
 	Assert(routine->fetch_tid != NULL);
 	Assert(routine->tuple_fetch_row_version != NULL);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9ab74c8df..1caffeb89 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -885,7 +885,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 28a23db0b..84a97b71d 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -315,7 +315,7 @@ ExecIndexOnlyMarkPos(IndexOnlyScanState *node)
 		}
 	}
 
-	index_markpos(node->ioss_ScanDesc);
+	table_index_scan_markpos(node->ioss_ScanDesc);
 }
 
 /* ----------------------------------------------------------------
@@ -344,7 +344,7 @@ ExecIndexOnlyRestrPos(IndexOnlyScanState *node)
 		}
 	}
 
-	index_restrpos(node->ioss_ScanDesc);
+	table_index_scan_restrpos(node->ioss_ScanDesc);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 457fbdb07..ba3d4996c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -872,7 +872,7 @@ ExecIndexMarkPos(IndexScanState *node)
 		}
 	}
 
-	index_markpos(node->iss_ScanDesc);
+	table_index_scan_markpos(node->iss_ScanDesc);
 }
 
 /* ----------------------------------------------------------------
@@ -901,7 +901,7 @@ ExecIndexRestrPos(IndexScanState *node)
 		}
 	}
 
-	index_restrpos(node->iss_ScanDesc);
+	table_index_scan_restrpos(node->iss_ScanDesc);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index f8421a74c..6e8fe07f6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -54,8 +54,8 @@
  *		the inner "5's". This requires repositioning the inner "cursor"
  *		to point at the first inner "5". This is done by "marking" the
  *		first inner 5 so we can restore the "cursor" to it before joining
- *		with the second outer 5. The access method interface provides
- *		routines to mark and restore to a tuple.
+ *		with the second outer 5. The table AM interface provides
+ *		routines to mark and restore to a tuple during index scans.
  *
  *
  *		Essential operation of the merge join algorithm is as follows:
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 3f5d4fa31..94fedf32c 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -44,7 +44,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
@@ -746,7 +746,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
-		if (index->amhasgettuple)
+		if (index->amcanplainscan)
 			add_path(rel, (Path *) ipath);
 
 		if (index->amhasgetbitmap &&
@@ -834,7 +834,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	switch (scantype)
 	{
 		case ST_INDEXSCAN:
-			if (!index->amhasgettuple)
+			if (!index->amcanplainscan)
 				return NIL;
 			break;
 		case ST_BITMAPSCAN:
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7c4be1748..54bbc6692 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -310,11 +310,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amcanplainscan = (amroutine->amgetbatch != NULL ||
+										amroutine->amgettuple != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amcanmarkpos;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
@@ -411,7 +411,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = false;
 				info->amsearchnulls = false;
 				info->amcanparallel = false;
-				info->amhasgettuple = false;
+				info->amcanplainscan = false;
 				info->amhasgetbitmap = false;
 				info->amcanmarkpos = false;
 				info->amcostestimate = NULL;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 296cbaede..a328fdb81 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -835,6 +835,7 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 {
 	AttrNumber	keycol;
 	oidvector  *indclass;
+	const IndexAmRoutine *amroutine;
 
 	/* The index must not be a partial index */
 	if (!heap_attisnull(idxrel->rd_indextuple, Anum_pg_index_indpred, NULL))
@@ -886,10 +887,12 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 		return false;
 
 	/*
-	 * The given index access method must implement "amgettuple", which will
-	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
+	 * The given index access method must implement "amgettuple" or
+	 * "amgetbatch", which will be used later to fetch the tuples.  See
+	 * RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	amroutine = GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false);
+	if (amroutine->amgettuple == NULL && amroutine->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3ef2d66f8..c5e1f9010 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -395,7 +395,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 RelationGetRelationName(rel));
 
 	/*
-	 * This assertion matches the one in index_getnext_tid().  See page
+	 * This assertion matches the one in heapam_indexscan.c.  See page
 	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
 	Assert(TransactionIdIsValid(RecentXmin));
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..249af48e6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -113,6 +113,7 @@ blhandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = true,
 		.amoptionalkey = true,
@@ -146,10 +147,12 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..8725fa36f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -117,6 +117,8 @@ typedef struct IndexAmRoutine
     bool        amconsistentordering;
     /* does AM support backward scanning? */
     bool        amcanbackward;
+    /* does AM support mark/restore of a scan position? */
+    bool        amcanmarkpos;
     /* does AM support UNIQUE indexes? */
     bool        amcanunique;
     /* does AM support multi-column indexes? */
@@ -167,10 +169,12 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amunguardbatch_function amunguardbatch; /* can be NULL */
+    amkillitemsbatch_function amkillitemsbatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -280,6 +284,19 @@ typedef struct IndexAmRoutine
    predicates, an update of such an attribute always disables <acronym>HOT</acronym>.
   </para>
 
+  <para>
+   The <structfield>amcanmarkpos</structfield> flag indicates whether the
+   index access method supports marking a scan position and later restoring
+   the scan to it.  The planner uses this to decide whether an index scan can
+   be used on the inner side of a merge join directly, or whether a
+   <literal>Materialize</literal> node must be interposed.  Mark and restore
+   build on the <function>amgetbatch</function> batch interface (see <xref
+    linkend="index-scanning"/>), so an access method that provides
+   <function>amgettuple</function> (not <function>amgetbatch</function>) must
+   leave this false.  An <function>amgetbatch</function> access method may
+   also leave it false if its scans cannot be rewound to an earlier position.
+  </para>
+
  </sect1>
 
  <sect1 id="index-functions">
@@ -676,8 +693,49 @@ ambeginscan (Relation indexRelation,
    <emphasis>must</emphasis> create this struct by calling
    <function>RelationGetIndexScan()</function>.  In most cases
    <function>ambeginscan</function> does little beyond making that call and perhaps
-   acquiring locks;
+   acquiring locks and initializing standard <structname>IndexScanDesc</structname> fields;
    the interesting parts of index-scan startup are in <function>amrescan</function>.
+   Index access methods that use the <function>amgetbatch</function> interface
+   must also set the following fields in the scan descriptor:
+   <itemizedlist>
+    <listitem>
+     <para>
+      <literal>scan-&gt;maxitemsbatch</literal>: the maximum number of items
+      that can appear in a single batch (typically derived from the index page
+      size, e.g., <literal>MaxIndexTuplesPerPage</literal>).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_index_opaque_static</literal>: the
+      <literal>MAXALIGN</literal>'d size of the index AM's mandatory per-batch
+      opaque area, whose size is fixed at compile time.  Each batch allocation
+      reserves this much space immediately before the
+      <structname>IndexScanBatchData</structname> pointer, for use by the index
+      AM to store per-page navigation state (e.g., batch index page's buffer pin
+      and sibling page links).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      <literal>scan-&gt;batch_tuples_workspace</literal>: the size in bytes
+      of the per-batch tuple storage workspace used for index-only scans
+      (typically <literal>BLCKSZ</literal>), or 0 if the index AM does not
+      support index-only scans.  The workspace is accessible via
+      <structfield>batch-&gt;currTuples</structfield>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <para>
+   An <function>amgetbatch</function> access method whose recheck requirement is
+   a fixed property of the whole scan (rather than something that varies from
+   one matching item to the next) should also set
+   <literal>scan-&gt;xs_recheck</literal> here, in
+   <function>ambeginscan</function>, since the value applies to every item the
+   scan returns.  The value set here persists across any subsequent
+   <function>amrescan</function> calls.  B-tree (always false) works this way.
   </para>
 
   <para>
@@ -749,6 +807,259 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 20, mark/restore of
+    scan positions is built on the <function>amgetbatch</function> interface;
+    <function>amgettuple</function> scans do not support mark/restore.  The
+    implementation lives in the table AM, through its
+    <function>table_index_scan_markpos</function> and
+    <function>table_index_scan_restrpos</function> implementations.  An
+    <function>amgetbatch</function> access method must additionally advertise
+    support by setting <structfield>amcanmarkpos</structfield>; when it does
+    not, the planner materializes the scan's output rather than relying on
+    mark/restore.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>IndexScanBatch</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples in the given
+   scan direction.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface is an alternative to
+   <function>amgettuple</function> for <firstterm>incremental index
+   scans</firstterm> &mdash; scans whose matches are returned one position at a
+   time (so they can drive a cursor), as opposed to a bitmap scan
+   (<function>amgetbitmap</function>), which returns all matches at once.
+   Where <function>amgettuple</function> returns one matching entry per call,
+   <function>amgetbatch</function> returns them in batches.  By returning all
+   matching index entries from a single index page together, the table AM gains
+   visibility into which table blocks will be needed in the near future.
+  </para>
+
+  <para>
+   The table AM passes <literal>priorbatch</literal> to indicate where the
+   index AM should continue scanning from (or <literal>NULL</literal> on the
+   first call for the scan).  The index AM uses information from
+   <literal>priorbatch</literal> to determine which index page to read next.
+   Unlike <function>amgettuple</function>, where the index AM maintains its
+   own scan position, with <function>amgetbatch</function> it is the caller
+   that controls the progress of the scan through the index.  The caller
+   will typically pass the most recently returned batch, but this is not
+   guaranteed &mdash; for example, following the restoration of a marked
+   position, an earlier batch may be passed instead.
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is associated with an
+   index page containing at least one matching item.  Until the table AM is
+   done following the batch's items to the heap, the index AM holds an
+   interlock against concurrent TID recycling by <command>VACUUM</command>
+   &mdash; almost always a buffer pin on that index page, though an index AM
+   may use some other interlock instead (see
+   <function>amunguardbatch</function>).  The table AM controls when this
+   interlock is dropped, by calling <function>amunguardbatch</function> when
+   it is safe to do so.  See <xref linkend="index-locking"/> for details on
+   buffer pin management during index scans.
+  </para>
+
+  <para>
+   An <type>IndexScanBatch</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the index access
+   method.  It is up to the table AM caller to decide when it should be
+   released.  Note also that <function>amgetbatch</function> functions must
+   never modify the <literal>priorbatch</literal> parameter.  The core
+   <filename>src/backend/access/nbtree/</filename> implementation provides a
+   reference examples of the <function>amgetbatch</function> interface.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+   Index access methods using <function>amgetbatch</function> must set
+   <literal>scan-&gt;xs_recheck</literal> to indicate whether rechecking of
+   scan keys is required, in the same way as <function>amgettuple</function>
+   does. However, <literal>scan-&gt;xs_recheck</literal> must be set consistently
+   for an entire scan rather than varying on a per-tuple basis. This is a key
+   difference from <function>amgettuple</function>, which can set
+   <literal>scan-&gt;xs_recheck</literal> independently for each tuple it returns.
+   Index access methods that require granular control over
+   <literal>scan-&gt;xs_recheck</literal> must use the <function>amgettuple</function>
+   interface instead of <function>amgetbatch</function>.
+  </para>
+
+  <para>
+   Similarly, the <function>amgetbatch</function> interface does not currently
+   support index-only scans that return data in the form of a
+   <structname>HeapTuple</structname> pointer stored in
+   <literal>scan-&gt;xs_hitup</literal>.
+  </para>
+
+  <para>
+   The index access method must provide either <function>amgetbatch</function>
+   or <function>amgettuple</function>, but not both.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
+void
+amunguardbatch (IndexScanDesc scan,
+                IndexScanBatch batch);
+</programlisting>
+   Called by the table AM (via
+   <function>tableam_util_unguard_batch</function>) when it is safe to drop
+   the TID recycling interlock that the index AM holds for the batch, which
+   prevents concurrent TID recycling by <command>VACUUM</command>.  This
+   interlock is opaque to core code: formally, an index AM may hold any kind
+   of interlock (or several) in its per-batch opaque area, and for that reason
+   is not even required to use the standard helper
+   <function>indexam_util_unlock_batch</function> to manage it.  In practice,
+   though, most or all index AMs will use that helper and hold the simplest
+   possible interlock: each guarded B-tree batch keeps a single buffer pin
+   on the one index page the batch came from.  See <xref
+    linkend="index-locking"/> for details on buffer pin management during
+   index scans.  This function will be called at most once for each guarded
+   batch; it is not called when the index AM has already unguarded the batch
+   itself (as it does when <structfield>batchImmediateUnguard</structfield> is
+   true, which is the common case).
+  </para>
+
+  <note>
+   <para>
+    The index AM may choose to retain its own buffer pins when this serves an
+    internal purpose (for example, maintaining a descent stack of pinned index
+    pages for reuse across <function>amgetbatch</function> calls).  However,
+    any scheme that retains buffer pins managed by the index AM must be sure
+    to free the pins at an opportune point (at a minimum whenever
+    <function>amendscan</function> is called, and typically when
+    <function>amrescan</function> is called).  It must also keep the number of
+    retained pins fixed and small.
+   </para>
+  </note>
+
+  <para>
+   The <function>amunguardbatch</function> function is required for any index
+   access method that provides <function>amgetbatch</function>.
+  </para>
+
+  <para>
+<programlisting>
+void
+amkillitemsbatch (IndexScanDesc scan,
+                  IndexScanBatch batch);
+</programlisting>
+   Called by the table AM when it has finished processing a batch that
+   contains dead items, to set <literal>LP_DEAD</literal> bits in the batch's
+   index page.  The batch's index page will not be locked by the caller; the
+   index AM must acquire and release its own lock (and pin) on the index page.
+  </para>
+
+  <para>
+   Implementing <function>amkillitemsbatch</function> is optional for
+   <function>amgetbatch</function> index AMs (those that don't can leave
+   the field set to <literal>NULL</literal>), but doing so is recommended for
+   performance, as it allows future scans to skip known-dead index entries.
+   The core index access method that currently supports
+   <function>amgetbatch</function> (B-tree) implements
+   <literal>LP_DEAD</literal> marking, though third-party index access methods
+   are free to choose whether to implement this feature.  The table AM may
+   call <function>tableam_util_scanpos_killitem</function> to mark dead items as
+   the scan progresses.  If the batch contains any such dead items, the batch's
+   <structfield>deadItems</structfield> array will have been sorted and
+   deduplicated before <function>amkillitemsbatch</function> is called: its
+   entries are indexes into the batch's <structfield>items</structfield>
+   array, sorted into ascending order with no index appearing more than once.
+   Because the <structfield>items</structfield> array is itself ordered by
+   increasing index page offset number, these dead-item indexes are likewise
+   in ascending page-offset order, so the index AM can walk
+   <structfield>deadItems</structfield> in lockstep with the index page's item
+   pointers (resolving each entry through <structfield>items</structfield> to
+   its page offset).  This also means the table AM need not call
+   <function>tableam_util_scanpos_killitem</function> in any particular order.
+  </para>
+
+  <note>
+   <para>
+    Index access methods using <function>amgettuple</function> rely on the
+    <structfield>kill_prior_tuple</structfield> mechanism instead to mark dead
+    tuples.
+   </para>
+  </note>
+
+  <para>
+   When implementing <function>amkillitemsbatch</function>, the index AM
+   must verify that the index page has not been modified since the batch was
+   originally read.  The standard way to do this is to call
+   <function>indexam_util_unlock_batch</function> during
+   <function>amgetbatch</function>, which releases the index page lock and
+   saves the page LSN in
+   the batch's <structfield>lsn</structfield> field.  Later, within
+   <function>amkillitemsbatch</function>, the index AM re-reads the page,
+   compares the current page LSN against
+   <structfield>batch-&gt;lsn</structfield>, and gives up on setting
+   <literal>LP_DEAD</literal> bits if the LSN has advanced.  An advanced LSN
+   indicates that the page was modified &mdash; possibly by
+   <command>VACUUM</command> recycling table TIDs &mdash; so it would be
+   unsafe to assume that index entries still point to the same heap/table
+   tuples.  Since <literal>LP_DEAD</literal> marking is only an optimization
+   hint, it is always safe to skip it.  B-tree uses this approach.
+  </para>
+
+  <warning>
+   <para>
+    This LSN comparison technique requires the index AM to use fake
+    (monotonically increasing) LSNs on its pages for relations where WAL is
+    not generated, since real LSNs are not available in that case.  See the
+    B-tree index implementation for a reference example of this
+    technique.  An index AM that does not implement fake LSNs can still
+    provide <function>amkillitemsbatch</function>, but should simply do
+    nothing when the relation does not generate WAL (i.e., when
+    <function>RelationNeedsWAL()</function> is false), since the LSN
+    comparison would be unreliable.
+   </para>
+  </warning>
+
+  <tip>
+   <para>
+    Index AMs are not obligated to use
+    <function>indexam_util_unlock_batch</function> &mdash; they can implement
+    their own equivalent, and are free to use the batch
+    <structfield>lsn</structfield> field in whatever way they deem necessary.
+   </para>
+  </tip>
+
+  <para>
+   This LSN-based verification means that the table AM need not consider
+   whether unguarding a batch could introduce TID recycling hazards for a
+   subsequent <function>amkillitemsbatch</function> call.  The hazards are the
+   same in both cases, but since <function>amkillitemsbatch</function>
+   independently verifies the page LSN and can always safely give up on
+   setting <literal>LP_DEAD</literal> bits, correctness is obvious without any
+   coupling between the two.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +1079,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgetbatch</function>,
+   or <function>amgettuple</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -781,6 +1092,39 @@ amgetbitmap (IndexScanDesc scan,
    struct must be set to NULL.
   </para>
 
+  <para>
+   Index access methods that support <function>amgetbatch</function> will
+   typically also support <function>amgetbitmap</function>, and almost all of
+   the index AM's internal scanning code is shared between the two paths.  The
+   main difference is that during <function>amgetbitmap</function> only one
+   batch is allocated at a time (via <function>indexam_util_alloc_batch</function>),
+   unlike <function>amgetbatch</function> where the table AM manages several
+   batches in a dedicated batch ring buffer data structure.
+  </para>
+
+  <para>
+   The only change needed to maintain this invariant is a single call to
+   <function>indexam_util_release_batch</function> at the point where the
+   scan moves between index pages, conditional on the scan's
+   <structfield>usebatchring</structfield> field being false (indicating a
+   bitmap index scan).  The index AM releases its prior batch
+   just as it is about to generate the next batch &mdash; the same point
+   where it extracts navigation state (such as sibling-page links) from
+   <literal>priorbatch</literal>.  No other changes to the index AM's
+   scanning logic are needed.  This early release is specific to
+   <function>amgetbitmap</function> scans; during
+   <function>amgetbatch</function> scans the <literal>priorbatch</literal>
+   is strictly owned by the caller (the table AM), and the index AM must
+   never release it.  See <function>_bt_next</function> for a reference
+   example.
+  </para>
+
+  <para>
+   The released batch is cached internally and reused by the next
+   <function>indexam_util_alloc_batch</function> call, avoiding repeated
+   memory allocation during the bitmap scan.
+  </para>
+
   <para>
 <programlisting>
 void
@@ -795,32 +1139,52 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan,
+            IndexScanBatch batch);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify the index AM that the table AM is about to change the scan's
+   logical position in a way that may invalidate index AM state that
+   independently tracks the scan's progress.  This callback is invoked when
+   the table AM is about to process a batch in a different direction than
+   was used when the batch was originally returned by
+   <function>amgetbatch</function>, and also when a marked scan position is
+   about to be restored.  Some index AMs maintain internal state that
+   advances in lockstep with the scan under the soft assumption that the scan
+   direction will not change.  Such state may fall behind the scan's true
+   position without harm (simply reading the next index page will allow the
+   state to <quote>catch up</quote>), but must never get ahead of it.  When
+   the scan direction changes or a marked position is restored, the assumption
+   is violated, so the index AM must reset the state to a safe starting point
+   for the batch's scan direction (as given by the batch's
+   <structfield>dir</structfield> field, discussed below).  For example,
+   B-tree uses this callback to reset its <literal>ScalarArrayOpExpr</literal>
+   array keys to their initial positions.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   When <function>amposreset</function> is called due to a cross-batch
+   direction change, the core system will have already flipped the batch's
+   <structfield>dir</structfield> field to reflect the new scan direction
+   before making the call.  The index AM should use this updated direction
+   when resetting any state that depends on knowing which way the scan is
+   proceeding.  When called to restore a marked position, the batch's
+   <structfield>dir</structfield> is not modified; it retains the direction
+   from when the batch was originally returned.  In both cases, the batch
+   passed to <function>amposreset</function> is the batch that will next be
+   passed to <function>amgetbatch</function> as its
+   <literal>priorbatch</literal>.  Note in particular that the
+   <literal>priorbatch</literal>.<structfield>dir</structfield> field is
+   guaranteed to have the same scan direction as when
+   <function>amposreset</function> was called.
   </para>
 
   <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   Index access methods that have private state which must be reset when the
+   scan position changes must provide an <function>amposreset</function>
+   implementation.  Index AMs with no such state may set
+   <structfield>amposreset</structfield> to NULL.  The
+   <function>amposreset</function> function can only be provided by an access
+   method that implements the <function>amgetbatch</function> interface.
   </para>
 
   <para>
@@ -975,6 +1339,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
        Access methods that always return entries in the natural ordering
        of their data (such as btree) should set
        <structfield>amcanorder</structfield> to true.
+       Both <function>amgetbatch</function> and <function>amgettuple</function>
+       scans support this capability.
        Currently, such access methods must use btree-compatible strategy
        numbers for their equality and ordering operators.
       </para>
@@ -994,34 +1360,40 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   Note that <function>amgetbatch</function> scans do not currently support
+   ordering operators.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function has a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
-   set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   returned batch must be the batch containing the last matching item(s),
+   rather than the batch containing the first matching item(s).
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   An <function>amgetbatch</function> index AM can support mark/restore of
+   scan positions by setting <structfield>amcanmarkpos</structfield>
+   (<function>amgettuple</function> index AMs do not support mark/restore).
+   The mark/restore implementation lives in the table AM and works with any
+   <function>amgetbatch</function> implementation; however, an index access
+   method whose scans cannot be rewound to an earlier position should leave
+   <structfield>amcanmarkpos</structfield> false, in which case the planner
+   materializes the scan's output where mark/restore would be needed.
+   Separately, index AMs that maintain internal state which tracks the scan's
+   progress must provide an <function>amposreset</function> callback to be
+   notified when the scan's logical position changes unexpectedly, such as
+   during mark/restore; see <xref linkend="index-functions"/> for details.
   </para>
 
   <para>
-   Both the scan position and the mark position (if any) must be maintained
+   The scan position must be maintained by the table AM and index AM
    consistently in the face of concurrent insertions or deletions in the
    index.  It is OK if a freshly-inserted entry is not returned by a scan that
    would have found the entry if it had existed when the scan started, or for
@@ -1040,15 +1412,17 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    which the index returns the actual data not just the TID of the heap tuple.
    This will only avoid I/O if the visibility map shows that the TID is on an
    all-visible page; else the heap tuple must be visited anyway to check
-   MVCC visibility.  But that is no concern of the access method's.
+   MVCC visibility.  But that is no concern of the index access method's.
   </para>
 
   <para>
-   Instead of using <function>amgettuple</function>, an index scan can be done with
-   <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
-   noticeably more efficient than <function>amgettuple</function> because it allows
-   avoiding lock/unlock cycles within the access method.  In principle
-   <function>amgetbitmap</function> should have the same effects as repeated
+   Instead of using <function>amgetbatch</function> or
+   <function>amgettuple</function>, an index scan can be done with
+   <function>amgetbitmap</function> to fetch all tuples in one call.  This can
+   be noticeably more efficient than an incremental index scan because the
+   table AM visits each table block at most once, in physical block order.
+   In principle <function>amgetbitmap</function> should have the
+   same effects as repeated <function>amgetbatch</function> or
    <function>amgettuple</function> calls, but we impose several restrictions to
    simplify matters.  First of all, <function>amgetbitmap</function> returns all
    tuples at once and marking or restoring scan positions isn't
@@ -1059,15 +1433,16 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    Also, there is no provision for index-only scans with
    <function>amgetbitmap</function>, since there is no way to return the contents of
    index tuples.
-   Finally, <function>amgetbitmap</function>
-   does not guarantee any locking of the returned tuples, with implications
-   spelled out in <xref linkend="index-locking"/>.
+   Finally, <function>amgetbitmap</function> does not hold any index page pins
+   after it returns (similarly to plain, non-index-only
+   <function>amgetbatch</function> scans that use an MVCC snapshot), as
+   described in <xref linkend="index-locking"/>.
   </para>
 
   <para>
    Note that it is permitted for an access method to implement only
-   <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
-   if its internal implementation is unsuited to one API or the other.
+   <function>amgetbitmap</function> and not <function>amgetbatch</function>/<function>amgettuple</function>,
+   or vice versa, if its internal implementation is unsuited to one API or the other.
   </para>
 
  </sect1>
@@ -1123,11 +1498,22 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
      </listitem>
      <listitem>
       <para>
-       An index scan must maintain a pin
-       on the index page holding the item last returned by
-       <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
-       entries from pages that are pinned by other backends.  The need
-       for this rule is explained below.
+       A pin must be held on any index page whose items might still need to
+       be followed, and <function>ambulkdelete</function> must acquire a
+       cleanup lock on each index page, which will block if any other
+       backend holds a pin on that page.
+       For <function>amgettuple</function> scans, the index access method
+       manages this pin directly, holding a pin on the current index page
+       until the scan moves to a different page or ends.
+       For <function>amgetbatch</function> scans, the index AM holds the
+       batch's interlock (typically a buffer pin on the batch's index page) in
+       its per-batch opaque area.
+       Depending on a scan-level policy described below, the interlock is then
+       either dropped eagerly (inside the index AM's call to
+       <function>indexam_util_unlock_batch</function>) before the batch is
+       returned to the table AM, or retained until the table AM releases it
+       (via the index AM's <function>amunguardbatch</function> callback).
+       The need for this rule is explained below.
       </para>
      </listitem>
     </itemizedlist>
@@ -1138,39 +1524,85 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    <command>VACUUM</command>.
    This creates no serious problems if that item
    number is still unused when the reader reaches it, since an empty
-   item slot will be ignored by <function>heap_fetch()</function>.  But what if a
+   item slot will simply be treated as not-visible.  But what if a
    third backend has already re-used the item slot for something else?
    When using an MVCC-compliant snapshot, there is no problem because
    the new occupant of the slot is certain to be too new to pass the
    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
    <literal>SnapshotAny</literal>), it would be possible to accept and return
-   a row that does not in fact match the scan keys.  We could defend
-   against this scenario by requiring the scan keys to be rechecked
-   against the heap row in all cases, but that is too expensive.  Instead,
-   we use a pin on an index page as a proxy to indicate that the reader
-   might still be <quote>in flight</quote> from the index entry to the matching
-   heap entry.  Making <function>ambulkdelete</function> block on such a pin ensures
-   that <command>VACUUM</command> cannot delete the heap entry before the reader
-   is done with it.  This solution costs little in run time, and adds blocking
-   overhead only in the rare cases where there actually is a conflict.
+   a wholly unrelated row (one that does not necessarily satisfy the scan
+   keys).  We can optionally use a pin on an index page as a proxy to indicate
+   that the reader might still be <quote>in flight</quote> from the index
+   entry to the matching heap entry.  Making <function>ambulkdelete</function>
+   block on such a pin ensures that <command>VACUUM</command> cannot delete
+   the heap entry before the reader is done with it.  This solution costs
+   little in run time, and adds blocking overhead only in the rare cases where
+   there actually is a conflict.  For plain index scans that use an
+   MVCC-compliant snapshot, holding the pin is unnecessary because the scan
+   will always visit the heap page, where the snapshot itself will reject any
+   recycled TID's new occupant.  (Index-only scans are a special case, as
+   discussed below.)
   </para>
 
   <para>
-   This solution requires that index scans be <quote>synchronous</quote>: we have
-   to fetch each heap tuple immediately after scanning the corresponding index
-   entry.  This is expensive for a number of reasons.  An
-   <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
-   and only visit the heap tuples sometime later, requires much less index
-   locking overhead and can allow a more efficient heap access pattern.
-   Per the above analysis, we must use the synchronous approach for
-   non-MVCC-compliant snapshots, but an asynchronous scan is workable
-   for a query using an MVCC snapshot.
+   This solution requires that <function>amgettuple</function> index scans be
+   <quote>synchronous</quote>: the table AM must fetch each heap tuple
+   immediately after scanning the corresponding index entry.  This is
+   expensive for a number of reasons.  The
+   <function>amgetbatch</function> interface, by contrast, was designed to
+   allow scans to be <quote>asynchronous</quote>.
   </para>
 
   <para>
-   In an <function>amgetbitmap</function> index scan, the access method does not
-   keep an index pin on any of the returned tuples.  Therefore
-   it is only safe to use such scans with MVCC-compliant snapshots.
+   Whether a batch's TID recycling interlock (typically an index page buffer
+   pin) is dropped immediately or deferred is controlled by a generic,
+   scan-level policy that is determined when the scan is opened &mdash; it is
+   not under the control of either the index AM or the table AM.  The scan's
+   <structfield>batchImmediateUnguard</structfield> flag encodes this policy.
+   It is set based on two criteria that are known to the core scan machinery:
+   whether the scan uses an MVCC-compliant snapshot, and whether it is an
+   index-only scan.  Specifically,
+   <structfield>batchImmediateUnguard</structfield> is true when the scan uses
+   an MVCC snapshot and is <emphasis>not</emphasis> an index-only scan.
+  </para>
+
+  <para>
+   When <structfield>batchImmediateUnguard</structfield> is true, the
+   interlock is dropped inside
+   <function>indexam_util_unlock_batch</function> (before the batch is even
+   returned to the table AM), because a plain index scan with an MVCC
+   snapshot will always visit the heap page, where the MVCC visibility check
+   is authoritative &mdash; even if <command>VACUUM</command> recycles a TID,
+   the new occupant cannot pass the snapshot test.
+  </para>
+
+  <para>
+   When <structfield>batchImmediateUnguard</structfield> is false, the
+   interlock is retained until the table AM explicitly releases it by calling
+   <function>tableam_util_unguard_batch</function> (which invokes the index
+   AM's <function>amunguardbatch</function> callback), because the scan cannot
+   rely on that heap page MVCC backstop.  For non-MVCC scans, there is no MVCC
+   snapshot to reject a recycled TID's new occupant at all.  For index-only
+   scans, even with an MVCC snapshot, the scan typically avoids visiting the
+   heap page altogether (using the visibility map instead), so the MVCC check
+   that would catch a recycled TID usually never runs.  In both cases the
+   interlock on the index page is what prevents <command>VACUUM</command> from
+   recycling TIDs while the scan is still in flight.
+  </para>
+
+  <para>
+   In the deferred (false) case, the table AM decides
+   <emphasis>when</emphasis> to call
+   <function>tableam_util_unguard_batch</function>, while the index AM's
+   <function>amunguardbatch</function> callback decides
+   <emphasis>what</emphasis> to release.
+  </para>
+
+  <para>
+   For the same reason, an <function>amgetbitmap</function> index scan &mdash;
+   which is inherently asynchronous, collecting all matching TIDs into a bitmap
+   before any heap access begins &mdash; requires an MVCC-compliant snapshot,
+   and has no need for the access method to hold index page pins.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e342585c7..b65038a55 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1173,12 +1173,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <function>amgetbatch</function>
+      or <function>amgettuple</function> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 31f8d2b81..3f5be6082 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -303,6 +303,7 @@ dihandler(PG_FUNCTION_ARGS)
 		.amconsistentequality = false,
 		.amconsistentordering = false,
 		.amcanbackward = false,
+		.amcanmarkpos = false,
 		.amcanunique = false,
 		.amcanmulticol = false,
 		.amoptionalkey = false,
@@ -334,10 +335,12 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amunguardbatch = NULL,
+		.amkillitemsbatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index ed946abed..89707e94c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -9901,29 +9901,6 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
    1 |   2 |   1 |   2
 (2 rows)
 
--- Exercise array keys mark/restore B-Tree code
-explain (costs off) select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 = any (array[1]);
-                     QUERY PLAN                     
-----------------------------------------------------
- Merge Join
-   Merge Cond: (j1.id1 = j2.id1)
-   Join Filter: (j2.id2 = j1.id2)
-   ->  Index Scan using j1_id1_idx on j1
-   ->  Index Scan using j2_id1_idx on j2
-         Index Cond: (id1 = ANY ('{1}'::integer[]))
-(6 rows)
-
-select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 = any (array[1]);
- id1 | id2 | id1 | id2 
------+-----+-----+-----
-   1 |   1 |   1 |   1
-   1 |   2 |   1 |   2
-(2 rows)
-
 -- Exercise array keys "find extreme element" B-Tree code
 explain (costs off) select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
@@ -9953,6 +9930,46 @@ reset enable_sort;
 drop table j1;
 drop table j2;
 drop table j3;
+-- Exercise btposreset when a merge join's mark/restore crosses a leaf-page/batch
+-- boundary with SAOP array keys set
+create table btposreset_outer (h int);
+insert into btposreset_outer values (0), (0), (99);
+create index on btposreset_outer (h);
+analyze btposreset_outer;
+set enable_hashjoin to 0;
+set enable_nestloop to 0;
+set enable_material to 0;
+set enable_sort to 0;
+-- tenk1_hundred is a low-cardinality deduplicated index, so hundred = 0 and
+-- hundred = 99 land on different leaf pages.  With tenk1 as the inner input,
+-- the scan marks in the hundred = 0 group and advances its array key to 99
+-- before the restore; btposreset must reset that array-key state, or the
+-- second outer "0" row loses all of its matches.  (The reset isn't required
+-- in the common case where we restore a mark within the same batch.)
+explain (costs off) select count(*) from btposreset_outer o
+inner join tenk1 i on o.h = i.hundred where i.hundred = any (array[0,99]);
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Merge Join
+         Merge Cond: (o.h = i.hundred)
+         ->  Index Only Scan using btposreset_outer_h_idx on btposreset_outer o
+         ->  Index Only Scan using tenk1_hundred on tenk1 i
+               Index Cond: (hundred = ANY ('{0,99}'::integer[]))
+(6 rows)
+
+select count(*) from btposreset_outer o
+inner join tenk1 i on o.h = i.hundred where i.hundred = any (array[0,99]);
+ count 
+-------
+   300
+(1 row)
+
+reset enable_hashjoin;
+reset enable_nestloop;
+reset enable_material;
+reset enable_sort;
+drop table btposreset_outer;
 -- check that semijoin inner is not seen as unique for a portion of the outerrel
 explain (verbose, costs off)
 select t1.unique1, t2.hundred
diff --git a/src/test/regress/expected/portals.out b/src/test/regress/expected/portals.out
index a66f27e36..8bcf469e4 100644
--- a/src/test/regress/expected/portals.out
+++ b/src/test/regress/expected/portals.out
@@ -700,6 +700,63 @@ SELECT name, statement, is_holdable, is_binary, is_scrollable FROM pg_cursors;
 ------+-----------+-------------+-----------+---------------
 (0 rows)
 
+--
+-- Cursor over a btree array-key (ScalarArrayOp) index scan that reverses
+-- direction exactly at a leaf-page boundary on the first page read
+--
+CREATE TEMP TABLE array_cursor_test (a int4, b int4, c int4, d int4);
+CREATE INDEX ON array_cursor_test (a, b, c, d) WITH (fillfactor = 30);
+INSERT INTO array_cursor_test
+  SELECT a, b, c, d
+  FROM generate_series(1, 2) a, generate_series(1, 2) b,
+       generate_series(1, 10) c, generate_series(1, 10) d
+  ORDER BY a, b, c, d;
+ANALYZE array_cursor_test;
+-- The index keeps (a=1, b=1, *) on a single leaf page.  Backing up at this
+-- leaf page boundary relies on a slow-path call to btposreset to avoid
+-- spuriously returning extra rows.
+BEGIN;
+SET LOCAL enable_seqscan = off;
+SET LOCAL enable_bitmapscan = off;
+SET LOCAL enable_indexonlyscan = off;
+DECLARE array_cursor CURSOR FOR
+  SELECT * FROM array_cursor_test
+  WHERE a IN (1, 2) AND b = 1 AND c = 9
+  ORDER BY a, b, c, d;
+-- read the 10 matching rows on the first leaf page, leaving the scan at the
+-- page boundary
+FETCH FORWARD 10 FROM array_cursor;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 9 |  1
+ 1 | 1 | 9 |  2
+ 1 | 1 | 9 |  3
+ 1 | 1 | 9 |  4
+ 1 | 1 | 9 |  5
+ 1 | 1 | 9 |  6
+ 1 | 1 | 9 |  7
+ 1 | 1 | 9 |  8
+ 1 | 1 | 9 |  9
+ 1 | 1 | 9 | 10
+(10 rows)
+
+-- reverse direction at the boundary; should return the prior 9 rows only
+FETCH BACKWARD 10 FROM array_cursor;
+ a | b | c | d 
+---+---+---+---
+ 1 | 1 | 9 | 9
+ 1 | 1 | 9 | 8
+ 1 | 1 | 9 | 7
+ 1 | 1 | 9 | 6
+ 1 | 1 | 9 | 5
+ 1 | 1 | 9 | 4
+ 1 | 1 | 9 | 3
+ 1 | 1 | 9 | 2
+ 1 | 1 | 9 | 1
+(9 rows)
+
+END;
+DROP TABLE array_cursor_test;
 --
 -- NO SCROLL disallows backward fetching
 --
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 78f7b4f54..e47e012cf 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -3776,15 +3776,6 @@ select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
 
--- Exercise array keys mark/restore B-Tree code
-explain (costs off) select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 = any (array[1]);
-
-select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 = any (array[1]);
-
 -- Exercise array keys "find extreme element" B-Tree code
 explain (costs off) select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
@@ -3802,6 +3793,35 @@ drop table j1;
 drop table j2;
 drop table j3;
 
+-- Exercise btposreset when a merge join's mark/restore crosses a leaf-page/batch
+-- boundary with SAOP array keys set
+create table btposreset_outer (h int);
+insert into btposreset_outer values (0), (0), (99);
+create index on btposreset_outer (h);
+analyze btposreset_outer;
+set enable_hashjoin to 0;
+set enable_nestloop to 0;
+set enable_material to 0;
+set enable_sort to 0;
+
+-- tenk1_hundred is a low-cardinality deduplicated index, so hundred = 0 and
+-- hundred = 99 land on different leaf pages.  With tenk1 as the inner input,
+-- the scan marks in the hundred = 0 group and advances its array key to 99
+-- before the restore; btposreset must reset that array-key state, or the
+-- second outer "0" row loses all of its matches.  (The reset isn't required
+-- in the common case where we restore a mark within the same batch.)
+explain (costs off) select count(*) from btposreset_outer o
+inner join tenk1 i on o.h = i.hundred where i.hundred = any (array[0,99]);
+
+select count(*) from btposreset_outer o
+inner join tenk1 i on o.h = i.hundred where i.hundred = any (array[0,99]);
+
+reset enable_hashjoin;
+reset enable_nestloop;
+reset enable_material;
+reset enable_sort;
+drop table btposreset_outer;
+
 -- check that semijoin inner is not seen as unique for a portion of the outerrel
 explain (verbose, costs off)
 select t1.unique1, t2.hundred
diff --git a/src/test/regress/sql/portals.sql b/src/test/regress/sql/portals.sql
index 196b862c7..ba42451bb 100644
--- a/src/test/regress/sql/portals.sql
+++ b/src/test/regress/sql/portals.sql
@@ -176,6 +176,40 @@ END;
 
 SELECT name, statement, is_holdable, is_binary, is_scrollable FROM pg_cursors;
 
+--
+-- Cursor over a btree array-key (ScalarArrayOp) index scan that reverses
+-- direction exactly at a leaf-page boundary on the first page read
+--
+
+CREATE TEMP TABLE array_cursor_test (a int4, b int4, c int4, d int4);
+CREATE INDEX ON array_cursor_test (a, b, c, d) WITH (fillfactor = 30);
+INSERT INTO array_cursor_test
+  SELECT a, b, c, d
+  FROM generate_series(1, 2) a, generate_series(1, 2) b,
+       generate_series(1, 10) c, generate_series(1, 10) d
+  ORDER BY a, b, c, d;
+ANALYZE array_cursor_test;
+
+-- The index keeps (a=1, b=1, *) on a single leaf page.  Backing up at this
+-- leaf page boundary relies on a slow-path call to btposreset to avoid
+-- spuriously returning extra rows.
+BEGIN;
+SET LOCAL enable_seqscan = off;
+SET LOCAL enable_bitmapscan = off;
+SET LOCAL enable_indexonlyscan = off;
+DECLARE array_cursor CURSOR FOR
+  SELECT * FROM array_cursor_test
+  WHERE a IN (1, 2) AND b = 1 AND c = 9
+  ORDER BY a, b, c, d;
+-- read the 10 matching rows on the first leaf page, leaving the scan at the
+-- page boundary
+FETCH FORWARD 10 FROM array_cursor;
+-- reverse direction at the boundary; should return the prior 9 rows only
+FETCH BACKWARD 10 FROM array_cursor;
+END;
+
+DROP TABLE array_cursor_test;
+
 --
 -- NO SCROLL disallows backward fetching
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6801894d7..84659b17a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -209,6 +209,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -236,8 +237,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -266,6 +265,9 @@ BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
 BatchMVCCState
+BatchMatchingItem
+BatchRingBuffer
+BatchRingItemPos
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
@@ -1261,6 +1263,7 @@ HbaLine
 HeadlineJsonState
 HeadlineParsedText
 HeadlineWordEntry
+HeapBatchData
 HeapCheckContext
 HeapCheckReadStreamData
 HeapPageFreeze
@@ -1341,6 +1344,8 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatch
+IndexScanBatchData
 IndexScanDesc
 IndexScanDescData
 IndexScanHeapData
@@ -3585,20 +3590,22 @@ amcanreturn_function
 amcostestimate_function
 amendscan_function
 amestimateparallelscan_function
+amgetbatch_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
+amkillitemsbatch_function
 amoptions_function
 amparallelrescan_function
+amposreset_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
+amunguardbatch_function
 amvacuumcleanup_function
 amvalidate_function
 array_iter
-- 
2.53.0

view thread (367+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <CAH2-WzkZTkDuyVFszLwPJesF9pS5E8m0UA+344bx-B-zfA5kaw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox