Re: index prefetching - Tomas Vondra

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tomas Vondra <[email protected]>
To: Andres Freund <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Subject: Re: index prefetching
Date: Mon, 16 Oct 2023 17:34:44 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

Hi,

Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.


0001-v5-20231016.patch
----------------------

Rebase to current master.


0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------

Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.


0003-remove-prefetch_reset-20231016.patch
-----------------------------------------

I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).


0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------

This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).

The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.

Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.

The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.


index-prefetch.pdf
------------------

Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.

I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).

The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):

* indexonlyscan (i.e. bad case) - About the same results as
  "indexscans", with the same speedups etc. Which is a good thing
  (i.e. IOS is not unexpectedly slower than regular indexscans).

* indexonlyscan-clean (i.e. good case) - Seems to have mostly the same
  performance as without the prefetching, except for the low-cardinality
  runs with many rows per key. I haven't checked what's causing this,
  but I'd bet it's the extra buffer lookups/management I mentioned.


I noticed there's another prefetching-related patch [1] from Thomas
Munro. I haven't looked at it yet, so hard to say how much it interferes
with this patch. But the idea looks interesting.


[1]
https://www.postgresql.org/message-id/flat/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail....

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

  [application/pdf] index-prefetch.pdf (109.4K, 2-index-prefetch.pdf)
  download

  [text/x-patch] 0001-v5-20231012.patch (40.8K, 3-0001-v5-20231012.patch)
  download | inline diff:
From 2faea9a5ef9b584853341489c9e30f11129638c0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 13 Oct 2023 22:33:51 +0200
Subject: [PATCH 1/4] v5

---
 src/backend/access/gist/gistget.c        |   1 -
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/genam.c         |   9 +-
 src/backend/access/index/indexam.c       | 411 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  16 +-
 src/backend/executor/nodeIndexscan.c     |  74 +++-
 src/backend/replication/walsender.c      |   2 +
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               | 113 ++++++-
 src/include/access/relscan.h             |   9 +
 src/include/executor/instrument.h        |   2 +
 15 files changed, 650 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 31349174280..3acfa762e7f 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -677,7 +677,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
-
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..46c85751cf2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -747,6 +748,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -755,7 +759,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..230667f888b 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..0b8f136f042 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/* with prefetching enabled, initialize the necessary state */
+	if (prefetch_target > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_target;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	return scan;
 }
 
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+	
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -345,6 +399,19 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/* If prefetching enabled, log prefetch stats. */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -487,10 +554,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +569,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -623,20 +693,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
 	for (;;)
 	{
+		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		if (PREFETCH_ACTIVE(prefetch))
+		{
+			/* 
+			 * incrementally ramp up prefetch distance
+			 *
+			 * XXX Intentionally done as first, so that with prefetching there's
+			 * always at least one item in the queue.
+			 */
+			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+										prefetch->prefetchMaxTarget);
+
+			/*
+			 * get more TID while there is empty space in the queue (considering
+			 * current prefetch target
+			 */
+			while (!PREFETCH_FULL(prefetch))
+			{
+				ItemPointer tid;
+
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+				{
+					prefetch->prefetchDone = true;
+					break;
+				}
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+				prefetch->queueEnd++;
+
+				index_prefetch(scan, tid);
+			}
+		}
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			if (PREFETCH_ENABLED(prefetch))
+			{
+				/* prefetching enabled, but reached the end and queue empty */
+				if (PREFETCH_DONE(prefetch))
+					break;
+
+				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+				prefetch->queueIndex++;
+			}
+			else	/* not prefetching, just do the regular work  */
+			{
+				ItemPointer tid;
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
 
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+					break;
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -988,3 +1112,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int		idx;
+
+	/* If the queue is empty, just store the block and we're done. */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Otherwise, check if it's the same as the immediately preceding block (we
+	 * don't want to prefetch the same block over and over.)
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/* Not the same block, so add it to the queue. */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/* check sequential patter a couple requests back */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/* not enough requests to confirm a sequential pattern */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * index of the already requested buffer (-1 because we already
+		 * incremented the index when adding the block to the queue)
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*  */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* calculate which LRU to use */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* entry to (maybe) use for this block request */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block
+	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the later cache, in case
+	 * we happen to access it later? That might help if we first scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* see if we already have prefetched this block (linear search of LRU) */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Request numbers are positive, so 0 means "unused". */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool	prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+	BlockNumber	block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	prefetch->countAll++;
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes
+	 * (fkey to a sequence ID). It's not expensive (the block is in page
+	 * cache already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 13217807eed..8837da91857 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3566,6 +3566,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3663,6 +3664,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 3c6730632de..09418f715fa 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..f3e1a8d22a4 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..75b44db33c6 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -658,7 +666,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -703,7 +712,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..185ff0f1449 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1662,6 +1701,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1674,7 +1728,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1710,6 +1766,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1717,7 +1781,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567eb..47093cc9cf1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4e626c615e7..b814af4b2f6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */
-- 
2.41.0



  [text/x-patch] 0002-comments-and-minor-cleanup-20231012.patch (31.2K, 4-0002-comments-and-minor-cleanup-20231012.patch)
  download | inline diff:
From 61b7123c6b3dbd716c6882716ce17239d38e0604 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 13 Oct 2023 22:34:40 +0200
Subject: [PATCH 2/4] comments and minor cleanup

---
 src/backend/access/gist/gistget.c        |   1 +
 src/backend/access/heap/heapam_handler.c |   5 +
 src/backend/access/index/genam.c         |  28 +-
 src/backend/access/index/indexam.c       | 328 ++++++++++++++++-------
 src/backend/executor/nodeIndexscan.c     |  17 ++
 src/backend/replication/walsender.c      |   2 -
 src/include/access/genam.h               |  12 -
 7 files changed, 273 insertions(+), 120 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 3acfa762e7f..31349174280 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -677,6 +677,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
+
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 46c85751cf2..ca91bc5e878 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -750,6 +750,11 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		int64		ci_val[2];
 		int			prefetch_target;
 
+		/*
+		 * Get the prefetch target for the old tablespace (which is what we'll
+		 * read using the index). We'll use it as a reset value too, although
+		 * there should be no rescans for CLUSTER etc.
+		 */
 		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 230667f888b..6e3aa6bb1fd 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,7 +126,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
-	/* set in each AM when applicable */
+	/* Information used for asynchronous prefetching during index scans. */
 	scan->xs_prefetch = NULL;
 
 	return scan;
@@ -443,7 +443,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		/* no index prefetch for system catalogs */
+		/*
+		 * We don't do any prefetching on system catalogs, for two main reasons.
+		 *
+		 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+		 * or we often don't know how many rows to expect (and the numbers tend
+		 * to be fairly low). So it's not clear it'd help. Furthermore, places
+		 * that are sensitive tend to use syscache anyway.
+		 *
+		 * Secondly, we can't call get_tablespace_io_concurrency() because that
+		 * does a sysscan internally, so it might lead to a cycle. We could use
+		 * use effective_io_concurrency, but it doesn't seem worth it.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
 										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
@@ -700,7 +711,18 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	/* no index prefetch for system catalogs */
+	/*
+	 * We don't do any prefetching on system catalogs, for two main reasons.
+	 *
+	 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+	 * or we often don't know how many rows to expect (and the numbers tend
+	 * to be fairly low). So it's not clear it'd help. Furthermore, places
+	 * that are sensitive tend to use syscache anyway.
+	 *
+	 * Secondly, we can't call get_tablespace_io_concurrency() because that
+	 * does a sysscan internally, so it might lead to a cycle. We could use
+	 * use effective_io_concurrency, but it doesn't seem worth it.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
 									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0b8f136f042..e45a3a89387 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -206,21 +206,30 @@ index_insert(Relation indexRelation,
  *
  * Caller must be holding suitable locks on the heap and the index.
  *
- * prefetch_target determines if prefetching is requested for this index scan.
- * We need to be able to disable this for two reasons. Firstly, we don't want
- * to do prefetching for IOS (where we hope most of the heap pages won't be
- * really needed. Secondly, we must prevent infinite loop when determining
- * prefetch value for the tablespace - the get_tablespace_io_concurrency()
- * does an index scan internally, which would result in infinite loop. So we
- * simply disable prefetching in systable_beginscan().
- *
- * XXX Maybe we should do prefetching even for catalogs, but then disable it
- * when accessing TableSpaceRelationId. We still need the ability to disable
- * this and catalogs are expected to be tiny, so prefetching is unlikely to
- * make a difference.
- *
- * XXX The second reason doesn't really apply after effective_io_concurrency
- * lookup moved to caller of index_beginscan.
+ * prefetch_target determines if prefetching is requested for this index scan,
+ * and how far ahead we want to prefetch
+ *
+ * prefetch_reset specifies the prefetch distance to start with on rescans (so
+ * that we don't ramp-up to prefetch_target and use that forever)
+ *
+ * Setting prefetch_target to 0 disables prefetching for the index scan. We do
+ * this for two reasons - for scans on system catalogs, and/or for cases where
+ * prefetching is expected to be pointless (like IOS).
+ *
+ * For system catalogs, we usually either scan by a PK value, or we we expect
+ * only few rows (or rather we don't know how many rows to expect). Also, we
+ * need to prevent infinite in the get_tablespace_io_concurrency() call - it
+ * does an index scan internally. So we simply disable prefetching for system
+ * catalogs. We could deal with this by picking a conservative static target
+ * (e.g. effective_io_concurrency, capped to something), but places that are
+ * performance sensitive likely use syscache anyway, and catalogs tend to be
+ * very small and hot. So we don't bother.
+ *
+ * For IOS, we expect to not need most heap pages (that's the whole point of
+ * IOS, actually), and prefetching them might lead to a lot of wasted I/O.
+ *
+ * XXX Not sure the infinite loop can still happen, now that the target lookup
+ * moved to callers of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
@@ -264,8 +273,12 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
+	/*
+	 * No prefetch for bitmap index scans. In this case prefetching happens at
+	 * the heapscan level.
+	 */
 	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
-									0, 0); /* no prefetch */
+									0, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -301,12 +314,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
+												norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/* with prefetching enabled, initialize the necessary state */
+	/* With prefetching requested, initialize the prefetcher state. */
 	if (prefetch_target > 0)
 	{
 		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
@@ -367,7 +381,7 @@ index_rescan(IndexScanDesc scan,
 		prefetcher->queueEnd = 0;
 		prefetcher->queueIndex = 0;
 		prefetcher->prefetchDone = false;
-	
+
 		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
 										 prefetcher->prefetchReset);
 	}
@@ -399,7 +413,11 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
-	/* If prefetching enabled, log prefetch stats. */
+	/*
+	 * If prefetching was enabled for this scan, log prefetch stats.
+	 *
+	 * FIXME This should really go to EXPLAIN ANALYZE instead.
+	 */
 	if (scan->xs_prefetch)
 	{
 		IndexPrefetch prefetch = scan->xs_prefetch;
@@ -554,8 +572,6 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
- *
- * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
@@ -693,25 +709,31 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
-	IndexPrefetch	prefetch = scan->xs_prefetch;
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
 
 	for (;;)
 	{
-		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		/*
+		 * If the prefetching is still active (i.e. enabled and we still
+		 * haven't finished reading TIDs from the scan), read enough TIDs into
+		 * the queue until we hit the current target.
+		 */
 		if (PREFETCH_ACTIVE(prefetch))
 		{
-			/* 
-			 * incrementally ramp up prefetch distance
+			/*
+			 * Ramp up the prefetch distance incrementally.
 			 *
-			 * XXX Intentionally done as first, so that with prefetching there's
-			 * always at least one item in the queue.
+			 * Intentionally done as first, before reading the TIDs into the
+			 * queue, so that there's always at least one item. Otherwise we
+			 * might get into a situation where we start with target=0 and no
+			 * TIDs loaded.
 			 */
 			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
-										prefetch->prefetchMaxTarget);
+										   prefetch->prefetchMaxTarget);
 
 			/*
-			 * get more TID while there is empty space in the queue (considering
-			 * current prefetch target
+			 * Now read TIDs from the index until the queue is full (with
+			 * respect to the current prefetch target).
 			 */
 			while (!PREFETCH_FULL(prefetch))
 			{
@@ -720,7 +742,10 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				/* Time to fetch the next TID from the index */
 				tid = index_getnext_tid(scan, direction);
 
-				/* If we're out of index entries, we're done */
+				/*
+				 * If we're out of index entries, we're done (and we mark the
+				 * the prefetcher as inactive).
+				 */
 				if (tid == NULL)
 				{
 					prefetch->prefetchDone = true;
@@ -732,22 +757,34 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
 				prefetch->queueEnd++;
 
+				/*
+				 * Issue the actuall prefetch requests for the new TID.
+				 *
+				 * FIXME For IOS, this should prefetch only pages that are not
+				 * fully visible.
+				 */
 				index_prefetch(scan, tid);
 			}
 		}
 
 		if (!scan->xs_heap_continue)
 		{
+			/*
+			 * With prefetching enabled (even if we already finished reading
+			 * all TIDs from the index scan), we need to return a TID from the
+			 * queue. Otherwise, we just get the next TID from the scan
+			 * directly.
+			 */
 			if (PREFETCH_ENABLED(prefetch))
 			{
-				/* prefetching enabled, but reached the end and queue empty */
+				/* Did we reach the end of the scan and the queue is empty? */
 				if (PREFETCH_DONE(prefetch))
 					break;
 
 				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
 				prefetch->queueIndex++;
 			}
-			else	/* not prefetching, just do the regular work  */
+			else				/* not prefetching, just do the regular work  */
 			{
 				ItemPointer tid;
 
@@ -1114,15 +1151,49 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 }
 
 /*
- * Add the block to the tiny top-level queue (LRU), and check if the block
- * is in a sequential pattern.
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
  */
 static bool
 index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 {
-	int		idx;
+	int			idx;
 
-	/* If the queue is empty, just store the block and we're done. */
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
 	if (prefetch->blockIndex == 0)
 	{
 		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
@@ -1131,30 +1202,66 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 	}
 
 	/*
-	 * Otherwise, check if it's the same as the immediately preceding block (we
-	 * don't want to prefetch the same block over and over.)
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
 	 */
 	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
 		return true;
 
-	/* Not the same block, so add it to the queue. */
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
 	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
 	prefetch->blockIndex++;
 
-	/* check sequential patter a couple requests back */
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
 	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
 	{
-		/* not enough requests to confirm a sequential pattern */
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
 		if (prefetch->blockIndex < i)
 			return false;
 
 		/*
-		 * index of the already requested buffer (-1 because we already
-		 * incremented the index when adding the block to the queue)
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
 		 */
 		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
 
-		/*  */
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
 		if (prefetch->blockItems[idx] != (block - i))
 			return false;
 	}
@@ -1164,30 +1271,34 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 
 /*
  * index_prefetch_add_cache
- *		Add a block to the cache, return true if it was recently prefetched.
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
  *
- * When checking a block, we need to check if it was recently prefetched,
- * where recently means within PREFETCH_CACHE_SIZE requests. This check
- * needs to be very cheap, even with fairly large caches (hundreds of
- * entries). The cache does not need to be perfect, we can accept false
- * positives/negatives, as long as the rate is reasonably low. We also
- * need to expire entries, so that only "recent" requests are remembered.
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
  *
- * A queue would allow expiring the requests, but checking if a block was
- * prefetched would be expensive (linear search for longer queues). Another
- * option would be a hash table, but that has issues with expiring entries
- * cheaply (which usually degrades the hash table).
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
  *
- * So we use a cache that is organized as multiple small LRU caches. Each
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
  * block is mapped to a particular LRU by hashing (so it's a bit like a
- * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
- * the most recent requests (for that particular LRU).
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
  *
- * This allows quick searches and expiration, with false negatives (when
- * a particular LRU has too many collisions).
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
  *
  * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
- * prefetch request in total.
+ * prefetch request in total (these are the default parameters.)
  *
  * The recency is determined using a prefetch counter, incremented every
  * time we end up prefetching a block. The counter is uint64, so it should
@@ -1197,33 +1308,39 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
  * and then linearly search if the tiny LRU has entry for the same block
  * and request less than PREFETCH_CACHE_SIZE ago.
  *
- * At the same time, we either update the entry (for the same block) if
+ * At the same time, we either update the entry (for the queried block) if
  * found, or replace the oldest/empty entry.
  *
  * If the block was not recently prefetched (i.e. we want to prefetch it),
  * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
  */
 static bool
 index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 {
 	PrefetchCacheEntry *entry;
 
-	/* calculate which LRU to use */
+	/* map the block number the the LRU */
 	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
 
-	/* entry to (maybe) use for this block request */
+	/* age/index of the oldest entry in the LRU, to maybe use */
 	uint64		oldestRequest = PG_UINT64_MAX;
 	int			oldestIndex = -1;
 
 	/*
 	 * First add the block to the (tiny) top-level LRU cache and see if it's
-	 * part of a sequential pattern. In this case we just ignore the block
-	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
 	 *
-	 * XXX Maybe we should still add the block to the later cache, in case
-	 * we happen to access it later? That might help if we first scan a lot
-	 * of the table sequentially, and then randomly. Not sure that's very
-	 * likely with index access, though.
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
 	 */
 	if (index_prefetch_is_sequential(prefetch, block))
 	{
@@ -1231,7 +1348,11 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 		return true;
 	}
 
-	/* see if we already have prefetched this block (linear search of LRU) */
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
 	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
 	{
 		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
@@ -1243,14 +1364,18 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 			oldestIndex = i;
 		}
 
-		/* Request numbers are positive, so 0 means "unused". */
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
 		if (entry->request == 0)
 			continue;
 
 		/* Is this entry for the same block as the current request? */
 		if (entry->block == block)
 		{
-			bool	prefetched;
+			bool		prefetched;
 
 			/*
 			 * Is the old request sufficiently recent? If yes, we treat the
@@ -1259,7 +1384,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 			 * XXX We do add the cache size to the request in order not to
 			 * have issues with uint64 underflows.
 			 */
-			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
 
 			/* Update the request number. */
 			entry->request = ++prefetch->prefetchReqNumber;
@@ -1276,6 +1401,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 	 */
 	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
 
+	/* FIXME do a nice macro */
 	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
 
 	entry->block = block;
@@ -1286,32 +1412,31 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 }
 
 /*
- * Do prefetching, and gradually increase the prefetch distance.
- *
- * XXX This is limited to a single index page (because that's where we get
- * currPos.items from). But index tuples are typically very small, so there
- * should be quite a bit of stuff to prefetch (especially with deduplicated
- * indexes, etc.). Does not seem worth reworking the index access to allow
- * more aggressive prefetching, it's best effort.
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
  *
  * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
  * prefetching does not cause significant regressions (e.g. for nestloop
- * with inner index scan). We could track number of index pages visited
- * and index tuples returned, to calculate avg tuples / page, and then
- * use that to limit prefetching after switching to a new page (instead
- * of just using prefetchMaxTarget, which can get much larger).
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
  *
- * XXX Obviously, another option is to use the planner estimates - we know
- * how many rows we're expected to fetch (on average, assuming the estimates
- * are reasonably accurate), so why not to use that. And maybe combine it
- * with the auto-tuning based on runtime statistics, described above.
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
  *
  * XXX The prefetching may interfere with the patch allowing us to evaluate
  * conditions on the index tuple, in which case we may not need the heap
  * tuple. Maybe if there's such filter, we should prefetch only pages that
  * are not all-visible (and the same idea would also work for IOS), but
  * it also makes the indexing a bit "aware" of the visibility stuff (which
- * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
  * (if the index-only filter is expected to eliminate only few rows, then
  * the vm check is pointless). Maybe this could/should be auto-tuning too,
  * i.e. we could track how many heap tuples were needed after all, and then
@@ -1324,13 +1449,13 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 static void
 index_prefetch(IndexScanDesc scan, ItemPointer tid)
 {
-	IndexPrefetch	prefetch = scan->xs_prefetch;
-	BlockNumber	block;
+	IndexPrefetch prefetch = scan->xs_prefetch;
+	BlockNumber block;
 
 	/*
-	 * No heap relation means bitmap index scan, which does prefetching at
-	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
-	 * without the heap)
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
 	 *
 	 * XXX But in this case we should have prefetchMaxTarget=0, because in
 	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
@@ -1339,9 +1464,6 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	if (!prefetch)
 		return;
 
-	/* was it initialized correctly? */
-	// Assert(prefetch->prefetchIndex != -1);
-
 	/*
 	 * If we got here, prefetching is enabled and it's a node that supports
 	 * prefetching (i.e. it can't be a bitmap index scan).
@@ -1355,9 +1477,9 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
-	 * This happens e.g. for clustered or naturally correlated indexes
-	 * (fkey to a sequence ID). It's not expensive (the block is in page
-	 * cache already, so no I/O), but it's not free either.
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
 	 */
 	if (!index_prefetch_add_cache(prefetch, block))
 	{
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 185ff0f1449..9796f8b979c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1710,6 +1710,11 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	 * essentially just effective_io_concurrency for the table (or the
 	 * tablespace it's in).
 	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
 	heapRel = node->ss.ss_currentRelation;
@@ -1770,6 +1775,18 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	int			prefetch_target;
 	int			prefetch_reset;
 
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
 	heapRel = node->ss.ss_currentRelation;
 
 	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 47093cc9cf1..e250b0567eb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,8 +1131,6 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
-		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
-
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b814af4b2f6..907ab886d3e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,18 +235,6 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
-/*
- * XXX not sure it's the right place to define these callbacks etc.
- */
-typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
-											  ScanDirection direction,
-											  int *start, int *end,
-											  bool *reset);
-
-typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
-													 ScanDirection direction,
-													 int index);
-
 /*
  * Cache of recently prefetched blocks, organized as a hash table of
  * small LRU caches. Doesn't need to be perfectly accurate, but we
-- 
2.41.0



  [text/x-patch] 0003-remove-prefetch_reset-20231012.patch (17.5K, 5-0003-remove-prefetch_reset-20231012.patch)
  download | inline diff:
From 61b13cc9a2a0445d6ab992520a945437cd0f275c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 13 Oct 2023 23:04:39 +0200
Subject: [PATCH 3/4] remove prefetch_reset

---
 src/backend/access/heap/heapam_handler.c |  6 +--
 src/backend/access/index/genam.c         |  4 +-
 src/backend/access/index/indexam.c       | 38 +++++++----------
 src/backend/executor/execIndexing.c      |  2 +-
 src/backend/executor/execReplication.c   |  2 +-
 src/backend/executor/nodeIndexonlyscan.c |  6 +--
 src/backend/executor/nodeIndexscan.c     | 53 ++++++++++--------------
 src/backend/utils/adt/selfuncs.c         |  3 +-
 src/include/access/genam.h               |  6 +--
 src/include/access/relscan.h             |  4 +-
 10 files changed, 52 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca91bc5e878..89474078951 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -748,14 +748,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
-		int			prefetch_target;
+		int			prefetch_max;
 
 		/*
 		 * Get the prefetch target for the old tablespace (which is what we'll
 		 * read using the index). We'll use it as a reset value too, although
 		 * there should be no rescans for CLUSTER etc.
 		 */
-		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
+		prefetch_max = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -765,7 +765,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tableScan = NULL;
 		heapScan = NULL;
 		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
-									prefetch_target, prefetch_target);
+									prefetch_max);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 6e3aa6bb1fd..d45a209ee3a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -456,7 +456,7 @@ systable_beginscan(Relation heapRelation,
 		 * use effective_io_concurrency, but it doesn't seem worth it.
 		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0, 0, 0);
+										 snapshot, nkeys, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -724,7 +724,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	 * use effective_io_concurrency, but it doesn't seem worth it.
 	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0, 0, 0);
+									 snapshot, nkeys, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index e45a3a89387..8c56acfd638 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -109,7 +109,7 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap,
-											  int prefetch_target, int prefetch_reset);
+											  int prefetch_max);
 
 static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
@@ -206,13 +206,10 @@ index_insert(Relation indexRelation,
  *
  * Caller must be holding suitable locks on the heap and the index.
  *
- * prefetch_target determines if prefetching is requested for this index scan,
+ * prefetch_max determines if prefetching is requested for this index scan,
  * and how far ahead we want to prefetch
  *
- * prefetch_reset specifies the prefetch distance to start with on rescans (so
- * that we don't ramp-up to prefetch_target and use that forever)
- *
- * Setting prefetch_target to 0 disables prefetching for the index scan. We do
+ * Setting prefetch_max to 0 disables prefetching for the index scan. We do
  * this for two reasons - for scans on system catalogs, and/or for cases where
  * prefetching is expected to be pointless (like IOS).
  *
@@ -236,14 +233,14 @@ index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys,
-				int prefetch_target, int prefetch_reset)
+				int prefetch_max)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
-									prefetch_target, prefetch_reset);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
+									NULL, false, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -273,12 +270,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	/*
-	 * No prefetch for bitmap index scans. In this case prefetching happens at
-	 * the heapscan level.
-	 */
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
-									0, 0);
+	/* No prefetch in bitmap scans, prefetch is done by the heap scan. */
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -296,7 +289,7 @@ static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
 						 ParallelIndexScanDesc pscan, bool temp_snap,
-						 int prefetch_target, int prefetch_reset)
+						 int prefetch_max)
 {
 	IndexScanDesc scan;
 
@@ -321,7 +314,7 @@ index_beginscan_internal(Relation indexRelation,
 	scan->xs_temp_snap = temp_snap;
 
 	/* With prefetching requested, initialize the prefetcher state. */
-	if (prefetch_target > 0)
+	if (prefetch_max > 0)
 	{
 		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
 
@@ -330,8 +323,7 @@ index_beginscan_internal(Relation indexRelation,
 		prefetcher->queueEnd = 0;
 
 		prefetcher->prefetchTarget = 0;
-		prefetcher->prefetchMaxTarget = prefetch_target;
-		prefetcher->prefetchReset = prefetch_reset;
+		prefetcher->prefetchMaxTarget = prefetch_max;
 
 		scan->xs_prefetch = prefetcher;
 	}
@@ -382,8 +374,8 @@ index_rescan(IndexScanDesc scan,
 		prefetcher->queueIndex = 0;
 		prefetcher->prefetchDone = false;
 
-		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
-										 prefetcher->prefetchReset);
+		/* restart the incremental ramp-up */
+		prefetcher->prefetchTarget = 0;
 	}
 }
 
@@ -576,7 +568,7 @@ index_parallelrescan(IndexScanDesc scan)
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 						 int norderbys, ParallelIndexScanDesc pscan,
-						 int prefetch_target, int prefetch_reset)
+						 int prefetch_max)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -585,7 +577,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true, prefetch_target, prefetch_reset);
+									pscan, true, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 09418f715fa..eae1d8f9233 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -773,7 +773,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f3e1a8d22a4..91676ccff95 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -210,7 +210,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
 	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
 	 */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 75b44db33c6..7c27913502c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -100,7 +100,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys,
-								   0, 0);	/* no index prefetch for IOS */
+								   0);	/* no prefetching for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -667,7 +667,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0, 0);	/* no index prefetch for IOS */
+								 0);	/* no prefetching for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -713,7 +713,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0, 0);	/* no index prefetch for IOS */
+								 0);	/* no prefetching for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 9796f8b979c..a5f5394ef49 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -105,12 +105,11 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
-		int	prefetch_target;
-		int	prefetch_reset;
+		int	prefetch_max;
 
 		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
 		 * tablespace it's in).
 		 *
 		 * XXX Should this also look at plan.plan_rows and maybe cap the target
@@ -118,8 +117,8 @@ IndexNext(IndexScanState *node)
 		 * just reset to that value during prefetching, after reading the next
 		 * index page (or rather after rescan)?
 		 */
-		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
 
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
@@ -130,8 +129,7 @@ IndexNext(IndexScanState *node)
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys,
-								   prefetch_target,
-								   prefetch_reset);
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -197,6 +195,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	estate = node->ss.ps.state;
 
@@ -218,9 +217,7 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
-		Relation heapRel = node->ss.ss_currentRelation;
-		int	prefetch_target;
-		int	prefetch_reset;
+		int	prefetch_max;
 
 		/*
 		 * Determine number of heap pages to prefetch for this index. This is
@@ -232,8 +229,8 @@ IndexNextWithReorder(IndexScanState *node)
 		 * just reset to that value during prefetching, after reading the next
 		 * index page (or rather after rescan)?
 		 */
-		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
 
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
@@ -244,8 +241,7 @@ IndexNextWithReorder(IndexScanState *node)
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys,
-								   prefetch_target,
-								   prefetch_reset);
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1701,9 +1697,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel;
-	int			prefetch_target;
-	int			prefetch_reset;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
 
 	/*
 	 * Determine number of heap pages to prefetch for this index. This is
@@ -1717,10 +1712,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
-	heapRel = node->ss.ss_currentRelation;
 
-	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1734,8 +1728,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan,
-								 prefetch_target,
-								 prefetch_reset);
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1771,9 +1764,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel;
-	int			prefetch_target;
-	int			prefetch_reset;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
 
 	/*
 	 * Determine number of heap pages to prefetch for this index. This is
@@ -1787,10 +1779,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
-	heapRel = node->ss.ss_currentRelation;
-
-	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1799,8 +1789,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan,
-								 prefetch_target,
-								 prefetch_reset);
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0b02b6265d0..52e3aaaf20a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6216,9 +6216,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/* XXX Maybe should do prefetching using the default prefetch parameters? */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
+								 1, 0, 0);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 907ab886d3e..ceb6279b0b0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -154,8 +154,7 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys,
-									 int prefetch_target,
-									 int prefetch_reset);
+									 int prefetch_max);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,8 +172,7 @@ extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan,
-											  int prefetch_target,
-											  int prefetch_reset);
+											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index c119fe597d8..231a30ecc46 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -107,7 +107,7 @@ typedef struct IndexFetchTableData
 } IndexFetchTableData;
 
 /*
- * Forward declaration, defined in genam.h.
+ * Forward declarations, defined in genam.h.
  */
 typedef struct IndexPrefetchData IndexPrefetchData;
 typedef struct IndexPrefetchData *IndexPrefetch;
@@ -168,7 +168,7 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
-	/* prefetching state (or NULL if disabled) */
+	/* prefetching state (or NULL if disabled for this scan) */
 	IndexPrefetchData *xs_prefetch;
 
 	/* parallel index scan information, in shared memory */
-- 
2.41.0



  [text/x-patch] 0004-PoC-prefetch-for-IOS-20231012.patch (11.3K, 6-0004-PoC-prefetch-for-IOS-20231012.patch)
  download | inline diff:
From 047eda4fa39e5b37b5beaaa3f24195a8d7aa6a5e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sat, 14 Oct 2023 00:13:26 +0200
Subject: [PATCH 4/4] PoC prefetch for IOS

---
 src/backend/access/index/indexam.c       | 140 ++++++++++++++++++++++-
 src/backend/executor/nodeIndexonlyscan.c |  63 +++++++++-
 src/include/access/genam.h               |   2 +
 3 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 8c56acfd638..4ae4e867770 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,6 +49,7 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
@@ -111,7 +112,7 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap,
 											  int prefetch_max);
 
-static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -755,7 +756,7 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				 * FIXME For IOS, this should prefetch only pages that are not
 				 * fully visible.
 				 */
-				index_prefetch(scan, tid);
+				index_prefetch(scan, tid, false);
 			}
 		}
 
@@ -1439,7 +1440,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
  * in BTScanPosData.nextPage.
  */
 static void
-index_prefetch(IndexScanDesc scan, ItemPointer tid)
+index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 {
 	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
@@ -1462,10 +1463,36 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	 */
 	Assert(scan->heapRelation);
 
-	prefetch->countAll++;
-
 	block = ItemPointerGetBlockNumber(tid);
 
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 */
+	if (skip_all_visible)
+	{
+		bool	all_visible;
+		Buffer	vmbuffer = InvalidBuffer;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 block,
+									 &vmbuffer);
+
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+
+		if (all_visible)
+			return;
+	}
+
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
@@ -1480,4 +1507,107 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 		pgBufferUsage.blks_prefetches++;
 	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid_prefetch - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
+
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+			prefetch->queueEnd++;
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, tid, true);
+		}
+	}
+
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
 }
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 7c27913502c..855afd5ba76 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -83,6 +84,23 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int			prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -100,7 +118,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys,
-								   0);	/* no prefetching for IOS */
+								   prefetch_max);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -124,7 +142,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_prefetch(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -654,6 +672,24 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->ioss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -667,7 +703,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0);	/* no prefetching for IOS */
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -705,6 +741,23 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ioss_ScanDesc =
@@ -713,7 +766,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0);	/* no prefetching for IOS */
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ceb6279b0b0..6e92758ced5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -175,6 +175,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_prefetch(IndexScanDesc scan,
+											  ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-- 
2.41.0

view thread (10+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: index prefetching
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox