index prefetching

public inbox for [email protected]  
help / color / mirror / Atom feed

index prefetching
4+ messages / 4 participants
[nested] [flat]

* index prefetching
@ 2023-06-08 15:40  Tomas Vondra <[email protected]>
  0 siblings, 3 replies; 4+ messages in thread

From: Tomas Vondra @ 2023-06-08 15:40 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>; +Cc: Georgios <[email protected]>

Hi,

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.


Motivation
----------

Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.

The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.

That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.

3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.


Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.

It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.

This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.


Problems / Open questions
-------------------------

There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.


2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...


3) prefetch distance

I think we can do various smart things about the prefetch distance.

The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.

I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).

But the patch doesn't do any of that now.


4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.


5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.


Benchmarks
----------

1) OLTP

For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).

The data sets have ~34GB, which is much more than available RAM (8GB).

For example for BTREE, we have a query like this:

   SELECT * FROM btree_test WHERE a = $v

with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:

   rows    bitmapscan     master    patched    seqscan
   1             19.8       20.4       18.8    31875.5
   10            24.4       23.8       23.2    30642.4
   100           27.7       40.0       26.3    31871.3
   1000          45.8      178.0       45.4    30754.1
   10000        171.8     1514.9      174.5    30743.3
   100000      1799.0    15993.3     1777.4    30937.3

This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.

I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.


For a query with ordering:

    SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n

the results look a bit different:

    rows      bitmapscan     master     patched     seqscan
    1            52703.9       19.5        19.5     31145.6
    10           51208.1       22.7        24.7     30983.5
    100          49038.6       39.0        26.3     32085.3
    1000         53760.4      193.9        48.4     31479.4
    10000        56898.4     1600.7       187.5     32064.5
    100000       50975.2    15978.7      1848.9     31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1] for
complete results.


Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

    query    serial    parallel
    1          101%         99%
    2          119%        100%
    3          100%         99%
    4          101%        100%
    5          101%        100%
    6           12%         99%
    7          100%        100%
    8           52%         67%
    10         102%        101%
    11         100%         72%
    12         101%        100%
    13         100%        101%
    14          13%        100%
    15         101%        100%
    16          99%         99%
    17          95%        101%
    18         101%        106%
    19          30%         40%
    20          99%        100%
    21         101%        100%
    22         101%        107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.


regards


[1] https://github.com/tvondra/index-prefetch-tests
[2] https://github.com/tvondra/postgres/tree/dev/index-prefetch


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

  [text/x-patch] index-prefetch-poc.patch (57.7K, 2-index-prefetch-poc.patch)
  download | inline diff:
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index efdf9415d15..9b3625d833b 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -193,7 +193,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys, int prefetch, int prefetch_reset);
 extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 6cc7d07164a..0c6da1b635b 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -25,7 +25,7 @@
  * Begin scan of bloom index.
  */
 IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation r, int nkeys, int norderbys, int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BloomScanOpaque so;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..5b298c02cce 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -324,7 +324,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
  * holding lock on index, it's not necessary to recompute it during brinrescan.
  */
 IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BrinOpaque *opaque;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb87..3087a986bc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
 
 
 IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GinScanOpaque so;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..7b79128f2ce 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -493,12 +493,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 
 			if (GistPageIsLeaf(page))
 			{
+				BlockNumber		block = ItemPointerGetBlockNumber(&it->t_tid);
+
 				/* Creating heap-tuple GISTSearchItem */
 				item->blkno = InvalidBlockNumber;
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
 
+				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
 				/*
 				 * In an index-only scan, also fetch the data from the tuple.
 				 */
@@ -529,6 +533,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	}
 
 	UnlockReleaseBuffer(buffer);
+
+	so->didReset = true;
 }
 
 /*
@@ -679,6 +685,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				so->curPageData++;
 
+				index_prefetch(scan, ForwardScanDirection);
+
 				return true;
 			}
 
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 00400583c0b..fdf978eaaad 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -22,6 +22,8 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Pairing heap comparison function for the GISTSearchItem queue
@@ -71,7 +73,7 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  */
 
 IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GISTSTATE  *giststate;
@@ -111,6 +113,31 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = gist_prefetch_getblock;
+		prefetcher->get_range = gist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	/*
@@ -356,3 +383,42 @@ gistendscan(IndexScanDesc scan)
 	 */
 	freeGISTstate(so->giststate);
 }
+
+static void
+gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->curPageData;
+		*end = (so->nPageData - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->curPageData;
+	}
+}
+
+static BlockNumber
+gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->curPageData) || (index >= so->nPageData))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->pageData[index].heapPtr;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606e..01a25132bce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -48,6 +48,9 @@ static void hashbuildCallback(Relation index,
 							  bool tupleIsAlive,
 							  void *state);
 
+static void _hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
 
 /*
  * Hash handler function: return IndexAmRoutine with access method parameters
@@ -362,7 +365,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	hashbeginscan() -- start a scan on a hash index
  */
 IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	HashScanOpaque so;
@@ -383,6 +386,31 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _hash_prefetch_getblock;
+		prefetcher->get_range = _hash_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -918,3 +946,42 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
+
+static void
+_hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9ea2a42a07f..b5cea5e23eb 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -434,6 +434,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_heaptid = currItem->heapTid;
 
+	index_prefetch(scan, dir);
+
 	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
@@ -467,6 +469,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 	so->currPos.buf = buf;
 	so->currPos.currPage = BufferGetBlockNumber(buf);
+	so->currPos.didReset = true;
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -597,6 +600,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	}
 
 	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
 	return true;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc21c..b2f4eadc1ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -756,6 +757,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,7 +768,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..aa8a14624d8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -59,6 +59,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +107,8 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +202,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+												prefetch_target, prefetch_reset);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
 
+	/* do index prefetching, if needed */
+	index_prefetch(scan, direction);
+
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_heaptid;
 }
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+void
+index_prefetch(IndexScanDesc scan, ScanDirection dir)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	/* gradually increase the prefetch distance */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Did we already reach the point to actually start prefetching? If not,
+	 * we're done. We'll try again for the next index tuple.
+	 */
+	if (prefetch->prefetchTarget <= 0)
+		return;
+
+	/*
+	 * XXX I think we don't need to worry about direction here, that's handled
+	 * by how the AMs build the curPos etc. (see nbtsearch.c)
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		bool		reset;
+		int			startIndex,
+					endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/*
+		 * Did we switch to a different index block? if yes, reset relevant
+		 * info so that we start prefetching from scratch.
+		 */
+		if (reset)
+		{
+			prefetch->prefetchTarget = prefetch->prefetchReset;
+			prefetch->prefetchIndex = startIndex; /* maybe -1 instead? */
+			pgBufferUsage.blks_prefetch_rounds++;
+		}
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the end index first, because it depends on
+		 * the actual position, before we consider how far we prefetched.
+		 */
+		endIndex = Min(endIndex, startIndex + prefetch->prefetchTarget);
+		startIndex = Max(startIndex, prefetch->prefetchIndex + 1);
+
+		for (int i = startIndex; i <= endIndex; i++)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = endIndex;
+	}
+	else
+	{
+		bool	reset;
+		int		startIndex,
+				endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/* FIXME handle the reset flag */
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the start index first, because it depends on
+		 * the actual position, before we consider how far we prefetched (which
+		 * for backwards scans is (end index).
+		 */
+		startIndex = Max(startIndex, endIndex - prefetch->prefetchTarget);
+		endIndex = Min(endIndex, prefetch->prefetchIndex - 1);
+
+		for (int i = endIndex; i >= startIndex; i--)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = startIndex;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b15199a..b1a02cc9bcd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -37,6 +37,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
+#include "utils/spccache.h"
 
 
 /*
@@ -87,6 +88,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  OffsetNumber updatedoffset,
 										  int *nremaining);
 
+static void _bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -341,7 +344,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	btbeginscan() -- start a scan on a btree index
  */
 IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -369,6 +372,31 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _bt_prefetch_getblock;
+		prefetcher->get_range = _bt_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -1423,3 +1451,42 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+static void
+_bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce95..762d95d09ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,7 +47,6 @@ static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
-
 /*
  *	_bt_drop_lock_and_maybe_pin()
  *
@@ -1385,7 +1384,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 */
 		_bt_parallel_done(scan);
 		BTScanPosInvalidate(so->currPos);
-
 		return false;
 	}
 	else
@@ -1538,6 +1536,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	Assert(BufferIsValid(so->currPos.buf));
 
+	/*
+	 * Mark the currPos as reset before loading the next chunk of pointers, to
+	 * restart the preretching.
+	 */
+	so->currPos.didReset = true;
+
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index cbfaf0c00ac..79015194b73 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relation.h"
 #include "access/relscan.h"
 #include "access/spgist_private.h"
 #include "miscadmin.h"
@@ -32,6 +33,10 @@ typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
 							   SpGistLeafTuple leafTuple, bool recheck,
 							   bool recheckDistances, double *distances);
 
+static void spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
+
 /*
  * Pairing heap comparison function for the SpGistSearchItem queue.
  * KNN-searches currently only support NULLS LAST.  So, preserve this logic
@@ -191,6 +196,7 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
 			pfree(so->reconTups[i]);
 	}
 	so->iPtr = so->nPtrs = 0;
+	so->didReset = true;
 }
 
 /*
@@ -301,7 +307,7 @@ spgPrepareScanKeys(IndexScanDesc scan)
 }
 
 IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	SpGistScanOpaque so;
@@ -316,6 +322,8 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 		so->keyData = NULL;
 	initSpGistState(&so->state, scan->indexRelation);
 
+	so->state.heap = relation_open(scan->indexRelation->rd_index->indrelid, NoLock);
+
 	so->tempCxt = AllocSetContextCreate(CurrentMemoryContext,
 										"SP-GiST search temporary context",
 										ALLOCSET_DEFAULT_SIZES);
@@ -371,6 +379,31 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 
 	so->indexCollation = rel->rd_indcollation[0];
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = spgist_prefetch_getblock;
+		prefetcher->get_range = spgist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -453,6 +486,8 @@ spgendscan(IndexScanDesc scan)
 		pfree(scan->xs_orderbynulls);
 	}
 
+	relation_close(so->state.heap, NoLock);
+
 	pfree(so);
 }
 
@@ -584,6 +619,13 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 														isnull,
 														distances);
 
+			// FIXME prefetch here? or in storeGettuple?
+			{
+				BlockNumber block = ItemPointerGetBlockNumber(&leafTuple->heapPtr);
+
+				PrefetchBuffer(so->state.heap, MAIN_FORKNUM, block);
+			}
+
 			spgAddSearchItemToQueue(so, heapItem);
 
 			MemoryContextSwitchTo(oldCxt);
@@ -1047,7 +1089,12 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				index_store_float8_orderby_distances(scan, so->orderByTypes,
 													 so->distances[so->iPtr],
 													 so->recheckDistances[so->iPtr]);
+
 			so->iPtr++;
+
+			/* prefetch additional tuples */
+			index_prefetch(scan, dir);
+
 			return true;
 		}
 
@@ -1070,6 +1117,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				pfree(so->reconTups[i]);
 		}
 		so->iPtr = so->nPtrs = 0;
+		so->didReset = true;
 
 		spgWalk(scan->indexRelation, so, false, storeGettuple,
 				scan->xs_snapshot);
@@ -1095,3 +1143,42 @@ spgcanreturn(Relation index, int attno)
 
 	return cache->config.canReturnData;
 }
+
+static void
+spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->iPtr;
+		*end = (so->nPtrs - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->iPtr;
+	}
+}
+
+static BlockNumber
+spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->iPtr) || (index >= so->nPtrs))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->heapPtrs[index];
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 190e4f76a9e..4aac68f0766 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -17,6 +17,7 @@
 
 #include "access/amvalidate.h"
 #include "access/htup_details.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
 #include "access/toast_compression.h"
@@ -334,6 +335,9 @@ initSpGistState(SpGistState *state, Relation index)
 
 	state->index = index;
 
+	/* we'll initialize the reference in spgbeginscan */
+	state->heap = NULL;
+
 	/* Get cached static information about index */
 	cache = spgGetCache(index);
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4476ff7fba1..80fec7a11f9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -160,7 +160,9 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
-											   int norderbys);
+											   int norderbys,
+											   int prefetch_maximum,
+											   int prefetch_reset);
 
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 97ddc925b27..f17dcdffd86 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -96,7 +96,7 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..6a500c5aa1f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,7 +152,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +171,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +234,45 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+
+
+void index_prefetch(IndexScanDesc scandesc, ScanDirection direction);
+
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchIndex;	/* how far we already prefetched */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+
+	/*
+	 * a small LRU cache of recently prefetched blocks
+	 *
+	 * XXX needs to be tiny, to make the (frequent) searches very cheap
+	 */
+	BlockNumber	cacheBlocks[8];
+	int			cacheIndex;
+
+	prefetcher_getblock_function	get_block;
+	prefetcher_getrange_function	get_range;
+
+} IndexPrefetchData;
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 6da64928b66..b4bd3b2e202 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -384,7 +384,7 @@ typedef struct GinScanOpaqueData
 
 typedef GinScanOpaqueData *GinScanOpaque;
 
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void ginendscan(IndexScanDesc scan);
 extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3edc740a3f3..e844a9eed84 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -176,6 +176,7 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+	bool	didReset;			/* reset since last access? */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 65911245f74..adf167a60b6 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,7 @@
 
 #include "access/amapi.h"
 
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 					   ScanKey orderbys, int norderbys);
 extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9e035270a16..743192997c5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -124,6 +124,8 @@ typedef struct HashScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	bool		didReset;
+
 	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
 } HashScanPosData;
 
@@ -370,7 +372,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d6847860959..8d053de461b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -984,6 +984,9 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	/* Did the position reset/rebuilt since the last time we checked it? */
+	bool		didReset;
+
 	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
@@ -1019,6 +1022,7 @@ typedef BTScanPosData *BTScanPos;
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
 		(scanpos).nextTupleOffset = 0; \
+		(scanpos).didReset = true; \
 	} while (0)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
@@ -1127,7 +1131,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index fe31d32dbe9..e1e2635597c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,7 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
 					  struct IndexInfo *indexInfo);
 
 /* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset);
 extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index c6ef46fc206..e00d4fc90b6 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -144,7 +144,7 @@ typedef struct SpGistTypeDesc
 typedef struct SpGistState
 {
 	Relation	index;			/* index we're working with */
-
+	Relation	heap;			/* heap the index is defined on */
 	spgConfigOut config;		/* filled in by opclass config method */
 
 	SpGistTypeDesc attType;		/* type of values to be indexed/restored */
@@ -231,6 +231,7 @@ typedef struct SpGistScanOpaqueData
 	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
 															 * flags */
 	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
+	bool		didReset;		/* */
 
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */


  [application/pdf] prefetching-bench.pdf (426.9K, 3-prefetching-bench.pdf)
  download

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2023-06-08 18:56  Peter Geoghegan <[email protected]>
  parent: Tomas Vondra <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: Peter Geoghegan @ 2023-06-08 18:56 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<[email protected]> wrote:
> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.

Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).

ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.

Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).

This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.

Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.

> But there's three shortcomings in logic:
>
> 1) It's not clear the thresholds for prefetching being beneficial and
> switching to bitmap index scans are the same value. And as I'll
> demonstrate later, the prefetching threshold is indeed much lower
> (perhaps a couple dozen matching tuples) on large tables.

As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.

It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.

That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.

> Implementation
> --------------
>
> When I started looking at this, I only really thought about btree. If
> you look at BTScanPosData, which is what the index scans use to
> represent the current leaf page, you'll notice it has "items", which is
> the array of item pointers (TIDs) that we'll fetch from the heap. Which
> is exactly the thing we need.

> So I ended up moving most of the prefetching logic up into indexam.c,
> see the index_prefetch() function. It can't be entirely separate,
> because each AM represents the current state in a different way (e.g.
> SpGistScanOpaque and BTScanOpaque are very different).

Maybe you were right to do that, but I'm not entirely sure.

Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.

In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.

> 4) per-leaf prefetching
>
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
>
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

I tend to agree that this sort of thing doesn't need to happen in the
first committed version.  But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).

> the results look a bit different:
>
>     rows      bitmapscan     master     patched     seqscan
>     1            52703.9       19.5        19.5     31145.6
>     10           51208.1       22.7        24.7     30983.5
>     100          49038.6       39.0        26.3     32085.3
>     1000         53760.4      193.9        48.4     31479.4
>     10000        56898.4     1600.7       187.5     32064.5
>     100000       50975.2    15978.7      1848.9     31587.1
>
> This is a good illustration of a query where bitmapscan is terrible
> (much worse than seqscan, in fact), and the patch is a massive
> improvement over master (about an order of magnitude).
>
> Of course, if you only scan a couple rows, the benefits are much more
> modest (say 40% for 100 rows, which is still significant).

Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).

--
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2023-06-09 21:19  Gregory Smith <[email protected]>
  parent: Tomas Vondra <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: Gregory Smith @ 2023-06-09 21:19 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <[email protected]>
wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans

At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux.  Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.

As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching.  Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work.  I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests.  To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages.  Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)
\set limit 100000 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid
LIMIT :range;

And then you use '-Dmultiplier=10' or such to crank it up.  Database 4X
RAM, multiplier=25 with 16 clients is my starting point on it when I want
to saturate storage.  Anything that lets me bring those numbers down would
be valuable.

--
Greg Smith  [email protected]
Director of Open Source Strategy

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2023-06-12 21:27  Tomasz Rybak <[email protected]>
  parent: Tomas Vondra <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: Tomasz Rybak @ 2023-06-12 21:27 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; +Cc: Georgios <[email protected]>

On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
> Hi,
> 
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.
> 

I added entry to 
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.

-- 
Tomasz Rybak, Debian Developer <[email protected]>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2023-06-12 21:27 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-06-08 15:40 index prefetching Tomas Vondra <[email protected]>
2023-06-08 18:56 ` Peter Geoghegan <[email protected]>
2023-06-09 21:19 ` Gregory Smith <[email protected]>
2023-06-12 21:27 ` Tomasz Rybak <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox