Re: New access method for b-tree.

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Alexandre Felipe <[email protected]>
To: Michał Kłeczek <[email protected]>
Cc: Ants Aasma <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Alexandre Felipe <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: New access method for b-tree.
Date: Thu, 5 Feb 2026 06:59:25 +0000
Message-ID: <CAE8JnxNM9Bh=LCGOzayewDgX3-kUNXdTDwNSDwsf+t=wKhPiCQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
	<[email protected]>
	<CANwKhkNd85u+4joaKR3YHoDOQSMg5SmJmsYJGo-tMyW=XVXTew@mail.gmail.com>
	<[email protected]>

Thank you for looking into this.

Now we can execute a, still narrow, family queries!

Maybe it helps to see this as a *social network feeds*. Imagine a social
network, you have a few friends, or follow a few people, and you want to
see their updates ordered by date. For each user we have a different
combination of users that we have to display. But maybe, even having
hundreds of users we will only show the first 10.

There is a low hanging fruit on the skip scan, if we need N rows, and one
group already has M rows we could stop there.
If Nx is the number of friends, and M is the number of posts to show.
This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
instead of (Nx * N) followed by an (Nx * N) sort.
Where M = 10 and N is 1000 this is a significant improvement.
But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
heap operations.
If everything is on the same page the skip scan would win.

The cost estimation is probably far off.
I am also not considering the filters applied after this operator, and I
don't know if the planner infrastructure is able to adjust it by itself.
This is where I would like reviewer's feedback. I think that the planner
costs are something to be determined experimentally.

Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...)
ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
(a, b, c)

---
Kind Regards,
Alexandre

On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]> wrote:

>
>
> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?
>
>
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
>
>
> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
>    SELECT * FROM posts
>    WHERE user_id = id
>    ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
>
>
> GIST can be used to handle this kind of queries as it supports multiple
> sort orders.
> The only problem is that GIST does not support ORDER BY column.
> One possible workaround is [1] but as described there it does not play
> well with partitioning.
> I’ve started drafting support for ORDER BY column in GIST - see [2].
> I think it would be easier to implement and maintain than a new IAM (but I
> don’t have enough knowledge and experience to implement it myself)
>
> [1]
> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
> [2]
> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>
> —
> Kind regards,
> Michal
>


Attachments:

  [application/octet-stream] 0002-MERGE-SCAN-Access-method.patch (49.1K, 3-0002-MERGE-SCAN-Access-method.patch)
  download | inline diff:
From d86b371499db011a36583d20963df68b09219190 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 14:27:18 +0000
Subject: [PATCH 2/3] [MERGE-SCAN]: Access method

---
 .gitignore                                    |   8 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/meson.build         |   1 +
 src/backend/access/nbtree/nbtmergescan.c      | 457 ++++++++++++++++++
 src/include/access/nbtree.h                   |  64 +++
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_btree_merge/Makefile    |  24 +
 .../expected/test_btree_merge.out             | 243 ++++++++++
 src/test/modules/test_btree_merge/meson.build |  33 ++
 .../test_btree_merge/sql/test_btree_merge.sql | 207 ++++++++
 .../test_btree_merge--1.0.sql                 |  43 ++
 .../test_btree_merge/test_btree_merge.c       | 389 +++++++++++++++
 .../test_btree_merge/test_btree_merge.control |   5 +
 13 files changed, 1476 insertions(+)
 create mode 100644 src/backend/access/nbtree/nbtmergescan.c
 create mode 100644 src/test/modules/test_btree_merge/Makefile
 create mode 100644 src/test/modules/test_btree_merge/expected/test_btree_merge.out
 create mode 100644 src/test/modules/test_btree_merge/meson.build
 create mode 100644 src/test/modules/test_btree_merge/sql/test_btree_merge.sql
 create mode 100644 src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
 create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.c
 create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.control

diff --git a/.gitignore b/.gitignore
index 4e911395fe3..ac1f95d9cf0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,11 @@ lib*.pc
 /Release/
 /tmp_install/
 /portlock/
+
+# hidden files (e.g. .dbdata, .install, good practice to test locally in isolation)
+.*
+
+# Test output
+**/regression.diffs
+**/regression.out
+**/results/
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 0daf640af96..72053cefdaa 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	nbtcompare.o \
 	nbtdedup.o \
 	nbtinsert.o \
+	nbtmergescan.o \
 	nbtpage.o \
 	nbtpreprocesskeys.o \
 	nbtreadpage.o \
diff --git a/src/backend/access/nbtree/meson.build b/src/backend/access/nbtree/meson.build
index 812f067e710..1016fea62d5 100644
--- a/src/backend/access/nbtree/meson.build
+++ b/src/backend/access/nbtree/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
   'nbtcompare.c',
   'nbtdedup.c',
   'nbtinsert.c',
+  'nbtmergescan.c',
   'nbtpage.c',
   'nbtpreprocesskeys.c',
   'nbtreadpage.c',
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
new file mode 100644
index 00000000000..70828dc73d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -0,0 +1,457 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtmergescan.c
+ *	  B-Tree merge scan for efficient evaluation of IN-list queries
+ *
+ * This module implements a K-way merge scan for B-tree indexes, optimized
+ * for queries of the form:
+ *   WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ *
+ * The algorithm maintains a min-heap of cursors, one per prefix value.
+ * Each cursor tracks its position within the index for that prefix.
+ * Tuples are returned in suffix order by repeatedly extracting the
+ * minimum from the heap.
+ *
+ * Target behavior: Access at most N + K - 1 index tuples for LIMIT N.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtmergescan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Forward declarations of static functions */
+static int	bt_merge_heap_cmp(const pairingheap_node *a,
+							  const pairingheap_node *b,
+							  void *arg);
+static bool bt_merge_cursor_init(BTMergeScanState *state,
+								 IndexScanDesc scan,
+								 BTMergeCursor *cursor,
+								 Datum prefix_value,
+								 bool prefix_isnull);
+static bool bt_merge_cursor_advance(BTMergeScanState *state,
+									IndexScanDesc scan,
+									BTMergeCursor *cursor);
+static Datum bt_merge_extract_sortkey(BTMergeScanState *state,
+									  IndexScanDesc scan,
+									  BTMergeCursor *cursor,
+									  bool *isnull);
+
+
+/*
+ * bt_merge_heap_cmp
+ *	  Compare two cursors by their current sort key (suffix value).
+ *
+ * When sort keys are equal, uses prefix value as tiebreaker for
+ * deterministic ordering (ORDER BY suffix, prefix).
+ *
+ * Returns positive if a > b (pairingheap is a max-heap, we want min-heap
+ * behavior so we invert the comparison).
+ */
+static int
+bt_merge_heap_cmp(const pairingheap_node *a,
+				  const pairingheap_node *b,
+				  void *arg)
+{
+	BTMergeScanState *state = (BTMergeScanState *) arg;
+	BTMergeCursor *cursor_a = pairingheap_container(BTMergeCursor, ph_node,
+													(pairingheap_node *) a);
+	BTMergeCursor *cursor_b = pairingheap_container(BTMergeCursor, ph_node,
+													(pairingheap_node *) b);
+	Datum		key_a = cursor_a->sort_key;
+	Datum		key_b = cursor_b->sort_key;
+	bool		null_a = cursor_a->sort_key_isnull;
+	bool		null_b = cursor_b->sort_key_isnull;
+	int32		cmp;
+
+	/* Handle NULLs - NULLs sort last (NULLS LAST default for ASC) */
+	if (null_a && null_b)
+		return 0;
+	if (null_a)
+		return -1;				/* a is NULL, comes after b */
+	if (null_b)
+		return 1;				/* b is NULL, comes after a */
+
+	/* Compare using the suffix column's comparison function */
+	cmp = DatumGetInt32(FunctionCall2Coll(&state->suffix_cmp,
+										  state->suffix_collation,
+										  key_a, key_b));
+
+	/*
+	 * Use prefix value as tiebreaker for deterministic ordering.
+	 * This ensures ORDER BY suffix, prefix behavior.
+	 */
+	if (cmp == 0)
+	{
+		/* Compare prefix values (assumes pass-by-value int4 for now) */
+		int32		prefix_a = DatumGetInt32(cursor_a->prefix_value);
+		int32		prefix_b = DatumGetInt32(cursor_b->prefix_value);
+
+		if (prefix_a < prefix_b)
+			cmp = -1;
+		else if (prefix_a > prefix_b)
+			cmp = 1;
+	}
+
+	/* Negate for min-heap behavior */
+	return -cmp;
+}
+
+
+/*
+ * bt_merge_init
+ *	  Initialize a merge scan state.
+ *
+ * Creates the merge state with one cursor per prefix value.
+ * The cursors will be positioned at their first matching tuples
+ * when bt_merge_getnext is first called.
+ */
+BTMergeScanState *
+bt_merge_init(IndexScanDesc scan,
+			  Datum *prefix_values,
+			  bool *prefix_nulls,
+			  int num_prefixes,
+			  int prefix_attno,
+			  int suffix_attno,
+			  Oid suffix_cmp_oid,
+			  Oid suffix_collation)
+{
+	BTMergeScanState *state;
+	MemoryContext merge_context;
+	MemoryContext old_context;
+	int			i;
+
+	/* Create memory context for merge scan allocations */
+	merge_context = AllocSetContextCreate(CurrentMemoryContext,
+										  "BTMergeScan",
+										  ALLOCSET_DEFAULT_SIZES);
+	old_context = MemoryContextSwitchTo(merge_context);
+
+	/* Allocate main state structure */
+	state = palloc0(sizeof(BTMergeScanState));
+	state->merge_context = merge_context;
+	state->num_cursors = num_prefixes;
+	state->active_cursors = 0;
+	state->prefix_attno = prefix_attno;
+	state->suffix_attno = suffix_attno;
+	state->suffix_collation = suffix_collation;
+	state->direction = ForwardScanDirection;
+	state->initialized = false;
+	state->tuples_accessed = 0;
+
+	/* Set up suffix comparison function */
+	fmgr_info(suffix_cmp_oid, &state->suffix_cmp);
+
+	/* Allocate cursor array */
+	state->cursors = palloc0(num_prefixes * sizeof(BTMergeCursor));
+
+	/* Initialize cursor metadata (not positioned yet) */
+	for (i = 0; i < num_prefixes; i++)
+	{
+		BTMergeCursor *cursor = &state->cursors[i];
+
+		cursor->cursor_id = i;
+		cursor->prefix_value = datumCopy(prefix_values[i], true, sizeof(Datum));
+		cursor->prefix_isnull = prefix_nulls[i];
+		cursor->exhausted = prefix_nulls[i];	/* NULL prefix = exhausted */
+		cursor->sort_key_isnull = true;
+		BTScanPosInvalidate(cursor->pos);
+		cursor->tuples = NULL;
+	}
+
+	/* Initialize the merge heap */
+	state->merge_heap = pairingheap_allocate(bt_merge_heap_cmp, state);
+
+	MemoryContextSwitchTo(old_context);
+
+	return state;
+}
+
+
+/*
+ * bt_merge_getnext
+ *	  Get the next tuple from the merge scan.
+ *
+ * Returns true if a tuple was found, false if scan is exhausted.
+ * The tuple's TID is stored in scan->xs_heaptid.
+ */
+bool
+bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTMergeScanState *state = so->mergeState;
+	BTMergeCursor *cursor;
+	pairingheap_node *node;
+	int			i;
+
+	if (state == NULL)
+		return false;
+
+	/* Initialize cursors on first call */
+	if (!state->initialized)
+	{
+		state->initialized = true;
+		state->direction = dir;
+
+		for (i = 0; i < state->num_cursors; i++)
+		{
+			BTMergeCursor *c = &state->cursors[i];
+
+			if (!c->exhausted &&
+				bt_merge_cursor_init(state, scan, c,
+									 c->prefix_value, c->prefix_isnull))
+			{
+				/* Cursor has at least one tuple, add to heap */
+				pairingheap_add(state->merge_heap, &c->ph_node);
+				state->active_cursors++;
+			}
+		}
+	}
+
+	/* Get the cursor with the smallest suffix value */
+	if (pairingheap_is_empty(state->merge_heap))
+		return false;
+
+	node = pairingheap_remove_first(state->merge_heap);
+	cursor = pairingheap_container(BTMergeCursor, ph_node, node);
+
+	/* Set up the heap TID from the current cursor position */
+	Assert(BTScanPosIsValid(cursor->pos));
+	scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+
+	/* Advance cursor to next tuple */
+	if (bt_merge_cursor_advance(state, scan, cursor))
+	{
+		/* Cursor still has tuples, re-add to heap */
+		pairingheap_add(state->merge_heap, &cursor->ph_node);
+	}
+	else
+	{
+		/* Cursor exhausted */
+		state->active_cursors--;
+	}
+
+	return true;
+}
+
+
+/*
+ * bt_merge_end
+ *	  Clean up merge scan state.
+ */
+void
+bt_merge_end(BTMergeScanState *state)
+{
+	if (state == NULL)
+		return;
+
+	/* Free the memory context, which frees all allocations */
+	MemoryContextDelete(state->merge_context);
+}
+
+
+/*
+ * bt_merge_cursor_init
+ *	  Initialize a cursor and position it at the first matching tuple.
+ *
+ * Returns true if the cursor found at least one matching tuple.
+ */
+static bool
+bt_merge_cursor_init(BTMergeScanState *state,
+					 IndexScanDesc scan,
+					 BTMergeCursor *cursor,
+					 Datum prefix_value,
+					 bool prefix_isnull)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found;
+
+	if (prefix_isnull)
+	{
+		cursor->exhausted = true;
+		return false;
+	}
+
+	/*
+	 * Modify the scan key to use this cursor's prefix value.
+	 * We reuse the scan's existing key infrastructure.
+	 */
+	for (int i = 0; i < so->numberOfKeys; i++)
+	{
+		if (so->keyData[i].sk_attno == state->prefix_attno)
+		{
+			so->keyData[i].sk_argument = prefix_value;
+			so->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
+			break;
+		}
+	}
+
+	/* Invalidate current position to force _bt_first */
+	BTScanPosInvalidate(so->currPos);
+
+	/* Disable array key handling for this cursor's scan */
+	so->numArrayKeys = 0;
+
+	/* Position at first matching tuple */
+	found = _bt_first(scan, state->direction);
+
+	if (found)
+	{
+		/* Copy position to cursor */
+		memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+		/* Extract the sort key for heap ordering */
+		cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+													&cursor->sort_key_isnull);
+		cursor->exhausted = false;
+
+		/* Count this as a tuple access */
+		state->tuples_accessed++;
+
+		/* Invalidate main scan position */
+		BTScanPosInvalidate(so->currPos);
+	}
+	else
+	{
+		cursor->exhausted = true;
+	}
+
+	return found;
+}
+
+
+/*
+ * bt_merge_cursor_advance
+ *	  Advance a cursor to its next tuple.
+ *
+ * Returns true if the cursor now points to a valid tuple, false if exhausted.
+ */
+static bool
+bt_merge_cursor_advance(BTMergeScanState *state,
+						IndexScanDesc scan,
+						BTMergeCursor *cursor)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+
+	if (cursor->exhausted)
+		return false;
+
+	/* Try to move to next tuple within current page's items array */
+	if (state->direction == ForwardScanDirection)
+	{
+		if (cursor->pos.itemIndex < cursor->pos.lastItem)
+		{
+			cursor->pos.itemIndex++;
+			found = true;
+		}
+	}
+	else
+	{
+		if (cursor->pos.itemIndex > cursor->pos.firstItem)
+		{
+			cursor->pos.itemIndex--;
+			found = true;
+		}
+	}
+
+	if (!found)
+	{
+		/*
+		 * Current page exhausted. Use _bt_next to get the next page.
+		 * We swap our cursor's position into the scan's currPos,
+		 * call _bt_next, then swap back.
+		 */
+		BTScanPosData save_pos;
+
+		memcpy(&save_pos, &so->currPos, sizeof(BTScanPosData));
+		memcpy(&so->currPos, &cursor->pos, sizeof(BTScanPosData));
+
+		found = _bt_next(scan, state->direction);
+
+		if (found)
+			memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+		memcpy(&so->currPos, &save_pos, sizeof(BTScanPosData));
+	}
+
+	if (found)
+	{
+		/* Extract new sort key */
+		cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+													&cursor->sort_key_isnull);
+		state->tuples_accessed++;
+	}
+	else
+	{
+		cursor->exhausted = true;
+	}
+
+	return found;
+}
+
+
+/*
+ * bt_merge_extract_sortkey
+ *	  Extract the sort key (suffix column value) from the current tuple.
+ */
+static Datum
+bt_merge_extract_sortkey(BTMergeScanState *state,
+						 IndexScanDesc scan,
+						 BTMergeCursor *cursor,
+						 bool *isnull)
+{
+	Relation	rel = scan->indexRelation;
+	Buffer		buf;
+	Page		page;
+	OffsetNumber offnum;
+	ItemId		itemid;
+	IndexTuple	itup;
+	TupleDesc	tupdesc;
+	Datum		result;
+
+	if (cursor->pos.currPage == InvalidBlockNumber)
+	{
+		*isnull = true;
+		return (Datum) 0;
+	}
+
+	/* Read the page */
+	buf = ReadBuffer(rel, cursor->pos.currPage);
+	LockBuffer(buf, BT_READ);
+	page = BufferGetPage(buf);
+
+	offnum = cursor->pos.items[cursor->pos.itemIndex].indexOffset;
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	tupdesc = RelationGetDescr(rel);
+
+	/* Extract the suffix column value */
+	result = index_getattr(itup, state->suffix_attno, tupdesc, isnull);
+
+	/* Copy pass-by-reference values before releasing buffer */
+	if (!*isnull)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupdesc, state->suffix_attno - 1);
+
+		if (!attr->attbyval)
+			result = datumCopy(result, attr->attbyval, attr->attlen);
+	}
+
+	UnlockReleaseBuffer(buf);
+
+	return result;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 77224859685..0d4e7440760 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -20,6 +20,7 @@
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
+#include "lib/pairingheap.h"
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
@@ -1050,6 +1051,49 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTMergeCursor - tracks scan state for one prefix value in merge scan
+ *
+ * Each cursor maintains its own position within the index for a specific
+ * prefix value. Cursors are organized in a min-heap ordered by their
+ * current suffix key value for efficient K-way merge.
+ */
+typedef struct BTMergeCursor
+{
+	pairingheap_node ph_node;	/* pairing heap node for merge */
+	int			cursor_id;		/* index in merge state's cursors array */
+	Datum		prefix_value;	/* the prefix value for this sub-scan */
+	bool		prefix_isnull;	/* is prefix value NULL? */
+	Datum		sort_key;		/* current tuple's sort key (suffix) */
+	bool		sort_key_isnull;/* is sort key NULL? */
+	bool		exhausted;		/* no more tuples for this prefix */
+	BTScanPosData pos;			/* current position in index */
+	char	   *tuples;			/* tuple storage workspace (BLCKSZ) */
+} BTMergeCursor;
+
+/*
+ * BTMergeScanState - state for K-way merge scan
+ *
+ * This structure manages multiple cursors for a merge scan, allowing
+ * lazy evaluation of queries like:
+ *   WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ */
+typedef struct BTMergeScanState
+{
+	int			num_cursors;	/* number of prefix values (K) */
+	int			active_cursors;	/* cursors not yet exhausted */
+	BTMergeCursor *cursors;		/* array of cursors */
+	pairingheap *merge_heap;	/* min-heap ordered by sort_key */
+	int			prefix_attno;	/* attribute number of prefix column (1-based) */
+	int			suffix_attno;	/* attribute number of suffix column (1-based) */
+	FmgrInfo	suffix_cmp;		/* comparison function for suffix */
+	Oid			suffix_collation;	/* collation for suffix comparison */
+	ScanDirection direction;	/* scan direction */
+	bool		initialized;	/* have cursors been initialized? */
+	MemoryContext merge_context;/* memory context for allocations */
+	int64		tuples_accessed;/* count of index tuples accessed */
+} BTMergeScanState;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1089,6 +1133,12 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/*
+	 * Merge scan state, if using merge scan optimization.
+	 * NULL if not using merge scan.
+	 */
+	BTMergeScanState *mergeState;
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1334,4 +1384,18 @@ extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
 
+/*
+ * prototypes for functions in nbtmergescan.c
+ */
+extern BTMergeScanState *bt_merge_init(IndexScanDesc scan,
+									   Datum *prefix_values,
+									   bool *prefix_nulls,
+									   int num_prefixes,
+									   int prefix_attno,
+									   int suffix_attno,
+									   Oid suffix_cmp_oid,
+									   Oid suffix_collation);
+extern bool bt_merge_getnext(IndexScanDesc scan, ScanDirection dir);
+extern void bt_merge_end(BTMergeScanState *state);
+
 #endif							/* NBTREE_H */
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2634a519935..b7b802bfdde 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -18,6 +18,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_aio')
 subdir('test_binaryheap')
 subdir('test_bitmapset')
+subdir('test_btree_merge')
 subdir('test_bloomfilter')
 subdir('test_cloexec')
 subdir('test_copy_callbacks')
diff --git a/src/test/modules/test_btree_merge/Makefile b/src/test/modules/test_btree_merge/Makefile
new file mode 100644
index 00000000000..540416a2c91
--- /dev/null
+++ b/src/test/modules/test_btree_merge/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_btree_merge/Makefile
+
+MODULE_big = test_btree_merge
+OBJS = \
+	$(WIN32RES) \
+	test_btree_merge.o
+
+PGFILEDESC = "test_btree_merge - test code for btree merge scan"
+
+EXTENSION = test_btree_merge
+DATA = test_btree_merge--1.0.sql
+
+REGRESS = test_btree_merge
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_btree_merge
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_btree_merge/expected/test_btree_merge.out b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
new file mode 100644
index 00000000000..baf4d7937e0
--- /dev/null
+++ b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
@@ -0,0 +1,243 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+CREATE EXTENSION test_btree_merge;
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+    prefix_col int4,
+    suffix_col int4
+);
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+     generate_series(1, 100) AS s;
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+    user_id int4,
+    event_time timestamp
+);
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+     generate_series(1, 100) AS e;
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+            test_name             
+----------------------------------
+ Test 1: Basic integer merge scan
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3],
+    50,
+    5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+               5 |               8 |                        7
+(1 row)
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+SELECT 'Test 2: More prefix values' AS test_name;
+         test_name          
+----------------------------
+ Test 2: More prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3, 4, 5],
+    80,
+    3
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+               3 |               8 |                        7
+(1 row)
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+SELECT 'Test 3: Single prefix value' AS test_name;
+          test_name          
+-----------------------------
+ Test 3: Single prefix value
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1],
+    50,
+    5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+               5 |               6 |                        5
+(1 row)
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+SELECT 'Test 4: Large LIMIT' AS test_name;
+      test_name      
+---------------------
+ Test 4: Large LIMIT
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3],
+    50,
+    200
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+             153 |             153 |                      153
+(1 row)
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+              test_name               
+--------------------------------------
+ Test 5: Non-contiguous prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[2, 5, 8],
+    50,
+    5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+               5 |               8 |                        7
+(1 row)
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+        test_name         
+--------------------------
+ Test 6: Timestamp suffix
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+    'merge_test_ts',
+    'merge_test_ts_idx',
+    ARRAY[1, 2, 3],
+    '2026-01-01 01:00:00'::timestamp,
+    5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+               5 |               8 |                        7
+(1 row)
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+SELECT 'Test 7: All users timestamp' AS test_name;
+          test_name          
+-----------------------------
+ Test 7: All users timestamp
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+    'merge_test_ts',
+    'merge_test_ts_idx',
+    ARRAY[1, 2, 3, 4, 5],
+    '2026-01-01 00:30:00'::timestamp,
+    10
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches 
+-----------------+-----------------+--------------------------
+              10 |              15 |                       14
+(1 row)
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+SELECT 'Test 8: Correctness verification' AS test_name;
+            test_name             
+----------------------------------
+ Test 8: Correctness verification
+(1 row)
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+    SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+    FROM test_btree_merge_fetch_int(
+        'merge_test_int',
+        'merge_test_int_idx',
+        ARRAY[1, 2, 3],
+        90,
+        10
+    )
+),
+regular_result AS (
+    SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+    FROM (
+        SELECT prefix_col, suffix_col
+        FROM merge_test_int
+        WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+        ORDER BY suffix_col, prefix_col
+        LIMIT 10
+    ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+       r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+   OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+ status | rn | prefix_col | suffix_col | expected_prefix | expected_suffix 
+--------+----+------------+------------+-----------------+-----------------
+(0 rows)
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/meson.build b/src/test/modules/test_btree_merge/meson.build
new file mode 100644
index 00000000000..665d6cf443e
--- /dev/null
+++ b/src/test/modules/test_btree_merge/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_btree_merge_sources = files(
+  'test_btree_merge.c',
+)
+
+if host_system == 'windows'
+  test_btree_merge_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_btree_merge',
+    '--FILEDESC', 'test_btree_merge - test code for btree merge scan',])
+endif
+
+test_btree_merge = shared_module('test_btree_merge',
+  test_btree_merge_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_btree_merge
+
+test_install_data += files(
+  'test_btree_merge.control',
+  'test_btree_merge--1.0.sql',
+)
+
+tests += {
+  'name': 'test_btree_merge',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_btree_merge',
+    ],
+  },
+}
diff --git a/src/test/modules/test_btree_merge/sql/test_btree_merge.sql b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
new file mode 100644
index 00000000000..5828b343b34
--- /dev/null
+++ b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
@@ -0,0 +1,207 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+
+CREATE EXTENSION test_btree_merge;
+
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+    prefix_col int4,
+    suffix_col int4
+);
+
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+     generate_series(1, 100) AS s;
+
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+    user_id int4,
+    event_time timestamp
+);
+
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+     generate_series(1, 100) AS e;
+
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+
+
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3],
+    50,
+    5
+);
+
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 2: More prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3, 4, 5],
+    80,
+    3
+);
+
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+
+SELECT 'Test 3: Single prefix value' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1],
+    50,
+    5
+);
+
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+
+SELECT 'Test 4: Large LIMIT' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[1, 2, 3],
+    50,
+    200
+);
+
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+    'merge_test_int',
+    'merge_test_int_idx',
+    ARRAY[2, 5, 8],
+    50,
+    5
+);
+
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+    'merge_test_ts',
+    'merge_test_ts_idx',
+    ARRAY[1, 2, 3],
+    '2026-01-01 01:00:00'::timestamp,
+    5
+);
+
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+
+SELECT 'Test 7: All users timestamp' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+    'merge_test_ts',
+    'merge_test_ts_idx',
+    ARRAY[1, 2, 3, 4, 5],
+    '2026-01-01 00:30:00'::timestamp,
+    10
+);
+
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+
+SELECT 'Test 8: Correctness verification' AS test_name;
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+    SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+    FROM test_btree_merge_fetch_int(
+        'merge_test_int',
+        'merge_test_int_idx',
+        ARRAY[1, 2, 3],
+        90,
+        10
+    )
+),
+regular_result AS (
+    SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+    FROM (
+        SELECT prefix_col, suffix_col
+        FROM merge_test_int
+        WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+        ORDER BY suffix_col, prefix_col
+        LIMIT 10
+    ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+       r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+   OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
new file mode 100644
index 00000000000..9872947d7d7
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
@@ -0,0 +1,43 @@
+/* src/test/modules/test_btree_merge/test_btree_merge--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_btree_merge" to load this file. \quit
+
+-- Test merge scan with integer columns
+CREATE FUNCTION test_btree_merge_scan_int(
+    table_name text,
+    index_name text,
+    prefix_values int4[],
+    suffix_start int4,
+    limit_count int4
+) RETURNS TABLE (
+    tuples_returned int4,
+    tuples_accessed int4,
+    maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Fetch actual rows from merge scan (for correctness verification)
+CREATE FUNCTION test_btree_merge_fetch_int(
+    table_name text,
+    index_name text,
+    prefix_values int4[],
+    suffix_start int4,
+    limit_count int4
+) RETURNS TABLE (
+    prefix_col int4,
+    suffix_col int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Test merge scan with timestamp suffix
+CREATE FUNCTION test_btree_merge_scan_ts(
+    table_name text,
+    index_name text,
+    prefix_values int4[],
+    suffix_start timestamp,
+    limit_count int4
+) RETURNS TABLE (
+    tuples_returned int4,
+    tuples_accessed int4,
+    maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.c b/src/test/modules/test_btree_merge/test_btree_merge.c
new file mode 100644
index 00000000000..78b22130ecf
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.c
@@ -0,0 +1,389 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_btree_merge.c
+ *		Unit tests for B-tree Merge Scan implementation
+ *
+ * This module provides SQL-callable functions to directly test the
+ * merge scan algorithm without going through the planner.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
+#include "commands/defrem.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define MAX_RESULTS 10000
+
+/*
+ * MergeScanResult - holds results from a merge scan execution
+ */
+typedef struct MergeScanResult
+{
+	int			tuples_returned;
+	int64		tuples_accessed;
+	int			num_prefixes;
+	int			limit_count;
+	/* For fetch function: collected row data */
+	int32	   *prefixes;
+	int32	   *suffixes;
+} MergeScanResult;
+
+/*
+ * do_merge_scan - common merge scan execution
+ *
+ * Performs a merge scan with the given parameters and collects results.
+ * If collect_rows is true, fetches and stores actual row data.
+ */
+static void
+do_merge_scan(const char *table_name,
+			  const char *index_name,
+			  Datum *prefix_values,
+			  bool *prefix_nulls,
+			  int num_prefixes,
+			  Datum suffix_start,
+			  Oid suffix_type,
+			  RegProcedure suffix_eq_proc,
+			  RegProcedure suffix_ge_proc,
+			  int limit_count,
+			  bool collect_rows,
+			  MergeScanResult *result)
+{
+	Oid			table_oid;
+	Oid			index_oid;
+	Relation	heap_rel;
+	Relation	index_rel;
+	IndexScanDesc scan;
+	BTScanOpaque so;
+	BTMergeScanState *merge_state;
+	Snapshot	snapshot;
+	Oid			suffix_cmp_oid;
+	Oid			opfamily;
+	const char *opfamily_name;
+	int			tuples_returned = 0;
+	int			max_results;
+
+	/* Determine operator family based on suffix type */
+	if (suffix_type == INT4OID)
+		opfamily_name = "integer_ops";
+	else if (suffix_type == TIMESTAMPOID)
+		opfamily_name = "datetime_ops";
+	else
+		elog(ERROR, "unsupported suffix type: %u", suffix_type);
+
+	/* Look up table and index */
+	table_oid = RelnameGetRelid(table_name);
+	if (!OidIsValid(table_oid))
+		elog(ERROR, "table \"%s\" does not exist", table_name);
+
+	index_oid = RelnameGetRelid(index_name);
+	if (!OidIsValid(index_oid))
+		elog(ERROR, "index \"%s\" does not exist", index_name);
+
+	/* Open relations */
+	heap_rel = table_open(table_oid, AccessShareLock);
+	index_rel = index_open(index_oid, AccessShareLock);
+
+	/* Get comparison function for suffix type */
+	opfamily = get_opfamily_oid(BTREE_AM_OID,
+								list_make1(makeString(pstrdup(opfamily_name))),
+								false);
+	suffix_cmp_oid = get_opfamily_proc(opfamily, suffix_type, suffix_type,
+									   BTORDER_PROC);
+	if (!OidIsValid(suffix_cmp_oid))
+		elog(ERROR, "could not find comparison function for type %u", suffix_type);
+
+	/* Begin index scan */
+	snapshot = GetActiveSnapshot();
+	scan = index_beginscan(heap_rel, index_rel, snapshot, NULL, 2, 0);
+
+	/* Set up scan keys */
+	{
+		ScanKeyData keys[2];
+
+		ScanKeyInit(&keys[0], 1, BTEqualStrategyNumber, suffix_eq_proc,
+					prefix_values[0]);
+		ScanKeyInit(&keys[1], 2, BTGreaterEqualStrategyNumber, suffix_ge_proc,
+					suffix_start);
+		index_rescan(scan, keys, 2, NULL, 0);
+	}
+
+	so = (BTScanOpaque) scan->opaque;
+
+	/* Initialize merge scan */
+	merge_state = bt_merge_init(scan, prefix_values, prefix_nulls,
+								num_prefixes, 1, 2, suffix_cmp_oid, InvalidOid);
+	so->mergeState = merge_state;
+
+	/* Execute scan */
+	max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+
+	while (tuples_returned < max_results)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		if (!bt_merge_getnext(scan, ForwardScanDirection))
+			break;
+
+		if (collect_rows && result->prefixes != NULL)
+		{
+			/* Fetch heap tuple to get actual values */
+			HeapTupleData heapTuple;
+			Buffer		heapBuffer;
+			bool		isnull;
+
+			heapTuple.t_self = scan->xs_heaptid;
+			if (heap_fetch(heap_rel, snapshot, &heapTuple, &heapBuffer, false))
+			{
+				result->prefixes[tuples_returned] =
+					DatumGetInt32(heap_getattr(&heapTuple, 1,
+											   RelationGetDescr(heap_rel), &isnull));
+				result->suffixes[tuples_returned] =
+					DatumGetInt32(heap_getattr(&heapTuple, 2,
+											   RelationGetDescr(heap_rel), &isnull));
+				ReleaseBuffer(heapBuffer);
+			}
+		}
+
+		tuples_returned++;
+
+		if (tuples_returned >= MAX_RESULTS)
+		{
+			elog(WARNING, "merge scan hit safety limit of %d tuples", MAX_RESULTS);
+			break;
+		}
+	}
+
+	/* Collect results before cleanup */
+	result->tuples_returned = tuples_returned;
+	result->tuples_accessed = merge_state->tuples_accessed;
+	result->num_prefixes = num_prefixes;
+	result->limit_count = limit_count;
+
+	/* Clean up */
+	bt_merge_end(merge_state);
+	so->mergeState = NULL;
+	index_endscan(scan);
+	index_close(index_rel, AccessShareLock);
+	table_close(heap_rel, AccessShareLock);
+}
+
+/*
+ * build_stats_result - build the stats result tuple
+ */
+static Datum
+build_stats_result(FunctionCallInfo fcinfo, MergeScanResult *result)
+{
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+	HeapTuple	tuple;
+	int			max_required_fetches;
+
+	/* Calculate expected max fetches */
+	if (result->tuples_returned < result->limit_count)
+		max_required_fetches = result->tuples_returned;
+	else
+		max_required_fetches = result->limit_count + result->num_prefixes - 1;
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("function returning record called in context "
+						"that cannot accept type record")));
+
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	values[0] = Int32GetDatum(result->tuples_returned);
+	values[1] = Int32GetDatum((int32) result->tuples_accessed);
+	values[2] = Int32GetDatum(max_required_fetches);
+
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+	return HeapTupleGetDatum(tuple);
+}
+
+
+/*
+ * test_btree_merge_scan_int - test merge scan with integer columns
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_int);
+
+Datum
+test_btree_merge_scan_int(PG_FUNCTION_ARGS)
+{
+	text	   *table_name = PG_GETARG_TEXT_PP(0);
+	text	   *index_name = PG_GETARG_TEXT_PP(1);
+	ArrayType  *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+	int32		suffix_start = PG_GETARG_INT32(3);
+	int32		limit_count = PG_GETARG_INT32(4);
+	Datum	   *prefix_values;
+	bool	   *prefix_nulls;
+	int			num_prefixes;
+	MergeScanResult result = {0};
+
+	deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+					  &prefix_values, &prefix_nulls, &num_prefixes);
+
+	if (num_prefixes == 0)
+		elog(ERROR, "prefix_values array cannot be empty");
+
+	do_merge_scan(text_to_cstring(table_name),
+				  text_to_cstring(index_name),
+				  prefix_values, prefix_nulls, num_prefixes,
+				  Int32GetDatum(suffix_start), INT4OID,
+				  F_INT4EQ, F_INT4GE,
+				  limit_count, false, &result);
+
+	return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_scan_ts - test merge scan with timestamp suffix
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_ts);
+
+Datum
+test_btree_merge_scan_ts(PG_FUNCTION_ARGS)
+{
+	text	   *table_name = PG_GETARG_TEXT_PP(0);
+	text	   *index_name = PG_GETARG_TEXT_PP(1);
+	ArrayType  *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+	Timestamp	suffix_start = PG_GETARG_TIMESTAMP(3);
+	int32		limit_count = PG_GETARG_INT32(4);
+	Datum	   *prefix_values;
+	bool	   *prefix_nulls;
+	int			num_prefixes;
+	MergeScanResult result = {0};
+
+	deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+					  &prefix_values, &prefix_nulls, &num_prefixes);
+
+	if (num_prefixes == 0)
+		elog(ERROR, "prefix_values array cannot be empty");
+
+	do_merge_scan(text_to_cstring(table_name),
+				  text_to_cstring(index_name),
+				  prefix_values, prefix_nulls, num_prefixes,
+				  TimestampGetDatum(suffix_start), TIMESTAMPOID,
+				  F_INT4EQ, F_TIMESTAMP_GE,
+				  limit_count, false, &result);
+
+	return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_fetch_int - fetch actual rows from merge scan
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_fetch_int);
+
+Datum
+test_btree_merge_fetch_int(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+
+	typedef struct
+	{
+		int32	   *prefixes;
+		int32	   *suffixes;
+		int			num_results;
+		int			current_idx;
+	} FetchContext;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *table_name = PG_GETARG_TEXT_PP(0);
+		text	   *index_name = PG_GETARG_TEXT_PP(1);
+		ArrayType  *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+		int32		suffix_start = PG_GETARG_INT32(3);
+		int32		limit_count = PG_GETARG_INT32(4);
+		Datum	   *prefix_values;
+		bool	   *prefix_nulls;
+		int			num_prefixes;
+		MemoryContext oldcontext;
+		FetchContext *fctx;
+		MergeScanResult result = {0};
+		TupleDesc	tupdesc;
+		int			max_results;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+						  &prefix_values, &prefix_nulls, &num_prefixes);
+
+		if (num_prefixes == 0)
+			elog(ERROR, "prefix_values array cannot be empty");
+
+		/* Allocate result storage */
+		max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+		fctx = palloc(sizeof(FetchContext));
+		fctx->prefixes = palloc(max_results * sizeof(int32));
+		fctx->suffixes = palloc(max_results * sizeof(int32));
+		fctx->current_idx = 0;
+
+		/* Point result to our storage */
+		result.prefixes = fctx->prefixes;
+		result.suffixes = fctx->suffixes;
+
+		do_merge_scan(text_to_cstring(table_name),
+					  text_to_cstring(index_name),
+					  prefix_values, prefix_nulls, num_prefixes,
+					  Int32GetDatum(suffix_start), INT4OID,
+					  F_INT4EQ, F_INT4GE,
+					  limit_count, true, &result);
+
+		fctx->num_results = result.tuples_returned;
+
+		/* Build result tuple descriptor */
+		tupdesc = CreateTemplateTupleDesc(2);
+		TupleDescInitEntry(tupdesc, 1, "prefix_col", INT4OID, -1, 0);
+		TupleDescInitEntry(tupdesc, 2, "suffix_col", INT4OID, -1, 0);
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+		funcctx->user_fctx = fctx;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	{
+		FetchContext *fctx = funcctx->user_fctx;
+
+		if (fctx->current_idx < fctx->num_results)
+		{
+			Datum		values[2];
+			bool		nulls[2] = {false, false};
+			HeapTuple	tuple;
+
+			values[0] = Int32GetDatum(fctx->prefixes[fctx->current_idx]);
+			values[1] = Int32GetDatum(fctx->suffixes[fctx->current_idx]);
+			fctx->current_idx++;
+
+			tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+			SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+		}
+		else
+		{
+			SRF_RETURN_DONE(funcctx);
+		}
+	}
+}
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.control b/src/test/modules/test_btree_merge/test_btree_merge.control
new file mode 100644
index 00000000000..f8146bd0f74
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.control
@@ -0,0 +1,5 @@
+# test_btree_merge extension
+comment = 'Unit tests for B-tree merge scan'
+default_version = '1.0'
+module_pathname = '$libdir/test_btree_merge'
+relocatable = true
-- 
2.40.0



  [application/octet-stream] 0003-MERGE-SCAN-Planner-integration.patch (27.6K, 4-0003-MERGE-SCAN-Planner-integration.patch)
  download | inline diff:
From ad123a3f8da3d95262b2553e90dd9c8fbb8d2335 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Thu, 5 Feb 2026 05:09:48 +0000
Subject: [PATCH 3/3] [MERGE-SCAN] Planner integration

---
 src/backend/access/index/genam.c          |   2 +
 src/backend/access/nbtree/nbtmergescan.c  |  60 ++++++-
 src/backend/access/nbtree/nbtree.c        | 129 +++++++++++++++
 src/backend/executor/nodeIndexonlyscan.c  |   5 +-
 src/backend/executor/nodeIndexscan.c      |  11 ++
 src/backend/optimizer/path/indxpath.c     | 188 ++++++++++++++++++++++
 src/backend/optimizer/plan/createplan.c   |   8 +
 src/backend/optimizer/util/pathnode.c     |   2 +
 src/include/access/relscan.h              |   3 +
 src/include/nodes/execnodes.h             |   5 +
 src/include/nodes/pathnodes.h             |   1 +
 src/include/nodes/plannodes.h             |   4 +
 src/test/regress/expected/btree_merge.out |  16 +-
 src/test/regress/sql/btree_merge.sql      |   9 ++
 14 files changed, 437 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62c..53615fb08d2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	scan->xs_num_merge_prefixes = 0;
+
 	return scan;
 }
 
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
index 70828dc73d3..eda1e683525 100644
--- a/src/backend/access/nbtree/nbtmergescan.c
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -27,6 +27,7 @@
 #include "access/relscan.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
@@ -169,7 +170,8 @@ bt_merge_init(IndexScanDesc scan,
 		cursor->exhausted = prefix_nulls[i];	/* NULL prefix = exhausted */
 		cursor->sort_key_isnull = true;
 		BTScanPosInvalidate(cursor->pos);
-		cursor->tuples = NULL;
+		/* Allocate tuple workspace for index-only scans */
+		cursor->tuples = palloc(BLCKSZ);
 	}
 
 	/* Initialize the merge heap */
@@ -219,6 +221,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
 				state->active_cursors++;
 			}
 		}
+
+		/*
+		 * Track internal tuple reads for stats. We read active_cursors tuples
+		 * during initialization. One of these will be returned first and
+		 * counted by index_getnext_tid, so we count (active_cursors - 1) here.
+		 */
+		if (state->active_cursors > 1)
+			pgstat_count_index_tuples(scan->indexRelation,
+									  state->active_cursors - 1);
 	}
 
 	/* Get the cursor with the smallest suffix value */
@@ -228,9 +239,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
 	node = pairingheap_remove_first(state->merge_heap);
 	cursor = pairingheap_container(BTMergeCursor, ph_node, node);
 
-	/* Set up the heap TID from the current cursor position */
+	/* Set up the heap TID and index tuple from the current cursor position */
 	Assert(BTScanPosIsValid(cursor->pos));
-	scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+	{
+		BTScanPosItem *currItem = &cursor->pos.items[cursor->pos.itemIndex];
+		scan->xs_heaptid = currItem->heapTid;
+		/* For index-only scans, set the index tuple pointer */
+		if (cursor->tuples)
+			scan->xs_itup = (IndexTuple) (cursor->tuples + currItem->tupleOffset);
+	}
 
 	/* Advance cursor to next tuple */
 	if (bt_merge_cursor_advance(state, scan, cursor))
@@ -255,9 +272,23 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
 void
 bt_merge_end(BTMergeScanState *state)
 {
+	int			i;
+
 	if (state == NULL)
 		return;
 
+	/* Release any buffer pins held by cursors */
+	for (i = 0; i < state->num_cursors; i++)
+	{
+		BTMergeCursor *cursor = &state->cursors[i];
+
+		if (BTScanPosIsValid(cursor->pos) && BufferIsValid(cursor->pos.buf))
+		{
+			ReleaseBuffer(cursor->pos.buf);
+			cursor->pos.buf = InvalidBuffer;
+		}
+	}
+
 	/* Free the memory context, which frees all allocations */
 	MemoryContextDelete(state->merge_context);
 }
@@ -302,8 +333,14 @@ bt_merge_cursor_init(BTMergeScanState *state,
 	/* Invalidate current position to force _bt_first */
 	BTScanPosInvalidate(so->currPos);
 
-	/* Disable array key handling for this cursor's scan */
+	/*
+	 * Disable array key handling for this cursor's scan.
+	 * We need to clear both numArrayKeys and needPrimScan to avoid
+	 * assertions in _bt_readfirstpage that expect array keys when
+	 * needPrimScan is set.
+	 */
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 
 	/* Position at first matching tuple */
 	found = _bt_first(scan, state->direction);
@@ -313,6 +350,16 @@ bt_merge_cursor_init(BTMergeScanState *state,
 		/* Copy position to cursor */
 		memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
 
+		/*
+		 * Copy the tuple data for index-only scans.
+		 * The tuple workspace contains copies of index tuples referenced
+		 * by items in currPos.
+		 */
+		if (so->currTuples && so->currPos.nextTupleOffset > 0)
+		{
+			memcpy(cursor->tuples, so->currTuples, so->currPos.nextTupleOffset);
+		}
+
 		/* Extract the sort key for heap ordering */
 		cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
 													&cursor->sort_key_isnull);
@@ -390,6 +437,11 @@ bt_merge_cursor_advance(BTMergeScanState *state,
 
 	if (found)
 	{
+		/*
+		 * Don't count here - the advanced-to tuple will be returned later
+		 * and counted by index_getnext_tid at that time.
+		 */
+
 		/* Extract new sort key */
 		cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
 													&cursor->sort_key_isnull);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee657d..0e55c4874b4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,6 +21,8 @@
 #include "access/nbtree.h"
 #include "access/relscan.h"
 #include "access/stratnum.h"
+#include "catalog/pg_amop.h"
+#include "utils/array.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "nodes/execnodes.h"
@@ -34,6 +36,7 @@
 #include "utils/datum.h"
 #include "utils/fmgrprotos.h"
 #include "utils/index_selfuncs.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
 
@@ -98,6 +101,8 @@ static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btsca
 										  BTScanOpaque so);
 static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
 										BTScanOpaque so);
+static bool bt_init_merge_scan_from_keys(IndexScanDesc scan);
+
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
@@ -221,6 +226,106 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 	return result;
 }
 
+/*
+ * bt_init_merge_scan_from_keys
+ *		Initialize merge scan state from the preprocessed scan keys.
+ *
+ * Returns true if merge scan was successfully initialized.
+ * Returns false if merge scan cannot be used (e.g., no suitable array key).
+ */
+static bool
+bt_init_merge_scan_from_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		arrayKey = NULL;
+	ArrayType  *arr;
+	Datum	   *prefix_values;
+	bool	   *prefix_nulls;
+	int			num_prefixes;
+	int			prefix_attno;
+	int			suffix_attno;
+	Oid			suffix_cmp_oid;
+	Oid			suffix_collation;
+	Oid			opfamily;
+	Oid			elemtype;
+	int16		elemlen;
+	bool		elembyval;
+	char		elemalign;
+	int			i;
+
+	/* Look for SK_SEARCHARRAY on first column in the raw scan keys */
+	for (i = 0; i < scan->numberOfKeys; i++)
+	{
+		ScanKey		sk = &scan->keyData[i];
+
+		if ((sk->sk_flags & SK_SEARCHARRAY) &&
+			sk->sk_attno == 1 &&
+			sk->sk_strategy == BTEqualStrategyNumber)
+		{
+			arrayKey = sk;
+			break;
+		}
+	}
+
+	if (arrayKey == NULL)
+		return false;
+
+	/* Extract array values from the scan key */
+	arr = DatumGetArrayTypeP(arrayKey->sk_argument);
+	num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+
+	if (num_prefixes < 2)
+		return false;
+
+	/* Get array element type info */
+	elemtype = ARR_ELEMTYPE(arr);
+	get_typlenbyvalalign(elemtype, &elemlen, &elembyval, &elemalign);
+
+	/* Deconstruct the array into individual elements */
+	deconstruct_array(arr, elemtype, elemlen, elembyval, elemalign,
+					  &prefix_values, &prefix_nulls, &num_prefixes);
+
+	/* Attribute numbers (1-based) */
+	prefix_attno = 1;
+	suffix_attno = 2;
+
+	/* Get the opfamily from the index */
+	opfamily = rel->rd_opfamily[suffix_attno - 1];
+
+	/* Get collation from the suffix column */
+	suffix_collation = TupleDescAttr(itupdesc, suffix_attno - 1)->attcollation;
+
+	/* Get the comparison function OID for the suffix column */
+	suffix_cmp_oid = get_opfamily_proc(opfamily,
+									   TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+									   TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+									   BTORDER_PROC);
+
+	if (!OidIsValid(suffix_cmp_oid))
+	{
+		pfree(prefix_values);
+		pfree(prefix_nulls);
+		return false;
+	}
+
+	/* Initialize the merge scan state */
+	so->mergeState = bt_merge_init(scan,
+								   prefix_values,
+								   prefix_nulls,
+								   num_prefixes,
+								   prefix_attno,
+								   suffix_attno,
+								   suffix_cmp_oid,
+								   suffix_collation);
+
+	pfree(prefix_values);
+	pfree(prefix_nulls);
+
+	return (so->mergeState != NULL);
+}
+
 /*
  *	btgettuple() -- Get the next tuple in the scan.
  */
@@ -235,6 +340,24 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
+	/*
+	 * Check if merge scan optimization should be used.
+	 * Initialize merge scan state on first call if needed.
+	 */
+	if (scan->xs_num_merge_prefixes > 0 && so->mergeState == NULL)
+	{
+		if (!bt_init_merge_scan_from_keys(scan))
+		{
+			/* Merge scan init failed, fall through to regular scan */
+			scan->xs_num_merge_prefixes = 0;
+		}
+	}
+
+	/* Use merge scan if initialized */
+	/* Use merge scan if initialized */
+	if (so->mergeState != NULL)
+		return bt_merge_getnext(scan, dir);
+
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
@@ -365,6 +488,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* Initialize merge scan state to NULL */
+	so->mergeState = NULL;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -486,6 +612,9 @@ btendscan(IndexScanDesc scan)
 		pfree(so->killedItems);
 	if (so->currTuples != NULL)
 		pfree(so->currTuples);
+	/* Clean up merge scan state */
+	if (so->mergeState != NULL)
+		bt_merge_end(so->mergeState);
 	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d09374517..70483c4e767 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -98,6 +98,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		node->ioss_ScanDesc = scandesc;
 
+		scandesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
@@ -615,7 +616,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ioss_RuntimeKeysReady = false;
 	indexstate->ioss_RuntimeKeys = NULL;
 	indexstate->ioss_NumRuntimeKeys = 0;
-
+	indexstate->ioss_NumMergePrefixes = node->num_merge_prefixes;
 	/*
 	 * build the index scan keys from the index qualification
 	 */
@@ -790,6 +791,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
+	node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
 	node->ioss_VMBuffer = InvalidBuffer;
 
 	/*
@@ -856,6 +858,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
+	node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff04c..9e62cacd2d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,7 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -211,6 +212,8 @@ IndexNextWithReorder(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -1086,6 +1089,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Initialize merge scan state from plan node
+	 */
+	indexstate->iss_NumMergePrefixes = node->num_merge_prefixes;
+
 	/*
 	 * all done.
 	 */
@@ -1725,6 +1733,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -1789,6 +1799,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 67d9dc35f44..44b79f91335 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/stratnum.h"
+#include "utils/array.h"
 #include "access/sysattr.h"
 #include "access/transam.h"
 #include "catalog/pg_am.h"
@@ -102,6 +103,8 @@ static bool eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
 static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							IndexOptInfo *index, IndexClauseSet *clauses,
 							List **bitindexpaths);
+static void consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+									 IndexOptInfo *index, IndexClauseSet *clauses);
 static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
@@ -770,6 +773,191 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
+
+	/*
+	 * Consider merge scan optimization for queries with:
+	 * - ScalarArrayOpExpr (IN clause) on first index column
+	 * - ORDER BY on second column (different from index leading column)
+	 * - Optionally LIMIT
+	 */
+	consider_merge_scan_path(root, rel, index, clauses);
+}
+
+/*
+ * consider_merge_scan_path
+ *	  Check if this index can provide a merge scan path for queries of the form:
+ *	  WHERE prefix IN (...) AND suffix >= b ORDER BY suffix, prefix LIMIT N
+ *
+ *	  Merge scan allows lazily producing output sorted by (suffix, prefix) from
+ *	  an index on (prefix, suffix) by doing a K-way merge of K separate scans.
+ */
+static void
+consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+						 IndexOptInfo *index, IndexClauseSet *clauses)
+{
+	IndexPath  *ipath;
+	List	   *index_clauses;
+	List	   *index_pathkeys;
+	List	   *merge_pathkeys;
+	ListCell   *lc;
+	int			num_prefixes = 0;
+	int			indexcol;
+	bool		has_saop_on_first = false;
+	bool		has_clause_on_second = false;
+
+	/* Need at least 2 index columns for merge scan */
+	if (index->nkeycolumns < 2)
+		return;
+
+	/* Index must be ordered and support gettuple */
+	if (index->sortopfamily == NULL || !index->amhasgettuple)
+		return;
+
+	/* Must have query pathkeys with at least 2 elements */
+	if (root->query_pathkeys == NIL || list_length(root->query_pathkeys) < 2)
+		return;
+
+	/*
+	 * Check for ScalarArrayOpExpr on first column.
+	 * Count the number of array elements (prefix values).
+	 */
+	foreach(lc, clauses->indexclauses[0])
+	{
+		IndexClause *iclause = (IndexClause *) lfirst(lc);
+		RestrictInfo *rinfo = iclause->rinfo;
+
+		if (IsA(rinfo->clause, ScalarArrayOpExpr))
+		{
+			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+			Node	   *arrayarg = (Node *) lsecond(saop->args);
+
+			has_saop_on_first = true;
+
+			/* Try to determine the number of array elements */
+			if (IsA(arrayarg, Const))
+			{
+				Const	   *con = (Const *) arrayarg;
+
+				if (!con->constisnull)
+				{
+					ArrayType  *arr = DatumGetArrayTypeP(con->constvalue);
+					num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+				}
+			}
+			else
+			{
+				/* Can't determine size, estimate conservatively */
+				num_prefixes = 10;
+			}
+			break;
+		}
+	}
+
+	if (!has_saop_on_first || num_prefixes < 2)
+		return;
+
+	/* Check if there's any clause on second column */
+	if (clauses->indexclauses[1] != NIL)
+		has_clause_on_second = true;
+
+	if (!has_clause_on_second)
+		return;
+
+	/*
+	 * Get the natural index pathkeys (prefix, suffix order).
+	 * We need at least 2 pathkeys for merge scan to make sense.
+	 */
+	index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection);
+	if (list_length(index_pathkeys) < 2)
+		return;
+
+	/*
+	 * Check if query pathkeys are (suffix, prefix) - the REVERSED order.
+	 * query_pathkeys[0] should match index_pathkeys[1] (suffix)
+	 * query_pathkeys[1] should match index_pathkeys[0] (prefix)
+	 */
+	{
+		PathKey    *qpk0 = (PathKey *) linitial(root->query_pathkeys);
+		PathKey    *qpk1 = (PathKey *) lsecond(root->query_pathkeys);
+		PathKey    *ipk0 = (PathKey *) linitial(index_pathkeys);
+		PathKey    *ipk1 = (PathKey *) lsecond(index_pathkeys);
+
+		/* Query's first pathkey must match index's SECOND pathkey (suffix) */
+		if (qpk0->pk_eclass != ipk1->pk_eclass)
+			return;
+
+		/* Query's second pathkey must match index's FIRST pathkey (prefix) */
+		if (qpk1->pk_eclass != ipk0->pk_eclass)
+			return;
+	}
+
+	/*
+	 * The merge scan can satisfy the query's ORDER BY (suffix, prefix).
+	 * Use the query's pathkeys directly since we've verified they match.
+	 * This is critical: PostgreSQL compares pathkeys by pointer equality.
+	 */
+	merge_pathkeys = root->query_pathkeys;
+
+	/*
+	 * Build the index clause list (same as normal path).
+	 */
+	index_clauses = NIL;
+	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
+	{
+		foreach(lc, clauses->indexclauses[indexcol])
+		{
+			IndexClause *iclause = (IndexClause *) lfirst(lc);
+			index_clauses = lappend(index_clauses, iclause);
+		}
+	}
+
+	/*
+	 * Create the merge scan path with (suffix, prefix) pathkeys.
+	 */
+	ipath = create_index_path(root, index,
+							  index_clauses,
+							  NIL,		/* no ORDER BY expressions */
+							  NIL,		/* no ORDER BY columns */
+							  merge_pathkeys,
+							  ForwardScanDirection,
+							  check_index_only(rel, index),
+							  NULL,		/* no outer relids */
+							  1.0,		/* loop_count */
+							  false);	/* not parallel */
+
+	/* Enable merge scan with K-way merge */
+	ipath->num_merge_prefixes = num_prefixes;
+
+	/*
+	 * Adjust costs and row estimate for merge scan.
+	 * Merge scan reads exactly (limit + K - 1) tuples instead of all matching.
+	 * The row estimate reflects actual tuple accesses, not total matches.
+	 */
+	if (root->limit_tuples > 0 && root->limit_tuples < ipath->path.rows)
+	{
+		double		merge_rows;
+		double		original_rows = ipath->path.rows;
+
+		/* Merge scan reads exactly (limit + K - 1) tuples */
+		merge_rows = root->limit_tuples + num_prefixes - 1;
+		if (merge_rows < original_rows)
+		{
+			double		ratio = merge_rows / original_rows;
+
+			/* Scale run cost by ratio of tuples accessed */
+			ipath->path.total_cost = ipath->path.startup_cost +
+				(ipath->path.total_cost - ipath->path.startup_cost) * ratio;
+
+			/* Add startup cost for K index descents */
+			ipath->path.startup_cost += num_prefixes * 0.01 * cpu_operator_cost;
+
+			/* Update row estimate to reflect merge scan efficiency */
+			ipath->path.rows = merge_rows;
+		}
+	}
+
+	/* Submit the path for consideration */
+	add_path(rel, (Path *) ipath);
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e5200f4b3ce..485b4b3e54e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,12 +184,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
+								 int num_merge_prefixes,
 								 ScanDirection indexscandir);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
+										 int num_merge_prefixes,
 										 ScanDirection indexscandir);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
@@ -3009,6 +3011,7 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
+												best_path->num_merge_prefixes,
 												best_path->indexscandir);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
@@ -3020,6 +3023,7 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
+											best_path->num_merge_prefixes,
 											best_path->indexscandir);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5527,6 +5531,7 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
+			   int num_merge_prefixes,
 			   ScanDirection indexscandir)
 {
 	IndexScan  *node = makeNode(IndexScan);
@@ -5543,6 +5548,7 @@ make_indexscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
+	node->num_merge_prefixes = num_merge_prefixes;
 	node->indexorderdir = indexscandir;
 
 	return node;
@@ -5557,6 +5563,7 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
+				   int num_merge_prefixes,
 				   ScanDirection indexscandir)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
@@ -5572,6 +5579,7 @@ make_indexonlyscan(List *qptlist,
 	node->recheckqual = recheckqual;
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
+	node->num_merge_prefixes = num_merge_prefixes;
 	node->indexorderdir = indexscandir;
 
 	return node;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7b6c5d51e5d..21746cd684c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1075,6 +1075,8 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
 
+	pathnode->num_merge_prefixes = 0;
+
 	cost_index(pathnode, root, loop_count, partial_path);
 
 	return pathnode;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076f8..fc55315ee07 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -190,6 +190,9 @@ typedef struct IndexScanDescData
 
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
+
+	/* Merge scan: K-way merge, ordered by an index suffix */
+	int			xs_num_merge_prefixes;
 } IndexScanDescData;
 
 /* Generic structure for parallel scans */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..4433d1c2612 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1734,6 +1734,9 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* Merge scan: K-way merge */
+	int			iss_NumMergePrefixes;
 } IndexScanState;
 
 /* ----------------
@@ -1780,6 +1783,8 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	/* Merge scan: K-way merge */
+	int			ioss_NumMergePrefixes;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fb808823acf..ced7e224a87 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			num_merge_prefixes;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4bc6fb5670e..86d8c92e01f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -597,6 +597,8 @@ typedef struct IndexScan
 	List	   *indexorderbyops;
 	/* forward or backward or don't care */
 	ScanDirection indexorderdir;
+	/* Merge scan: K-way merge */
+	int			num_merge_prefixes;
 } IndexScan;
 
 /* ----------------
@@ -645,6 +647,8 @@ typedef struct IndexOnlyScan
 	List	   *indextlist;
 	/* forward or backward or don't care */
 	ScanDirection indexorderdir;
+	/* Merge scan: K-way merge */
+	int			num_merge_prefixes;
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
index 441ae1d0657..28509b331d7 100644
--- a/src/test/regress/expected/btree_merge.out
+++ b/src/test/regress/expected/btree_merge.out
@@ -82,6 +82,20 @@ SHOW track_counts;  -- should be 'on'
  on
 (1 row)
 
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Limit
+   ->  Index Only Scan using btree_merge_test_idx on btree_merge_test
+         Index Cond: ((x = ANY ('{1,2,5,8,13,21,34,55}'::integer[])) AND (y >= 19))
+(3 rows)
+
 -- From the limited query proposition this can be computed with 10
 -- tupple accesses.
 SELECT x, y
@@ -107,7 +121,7 @@ FROM pg_stat_user_indexes
 WHERE indexrelname = 'btree_merge_test_idx';
  idx_scan | idx_tup_read | idx_tup_fetch 
 ----------+--------------+---------------
-        5 |           10 |            10
+        8 |            9 |             3
 (1 row)
 
 DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
index be00c33c2a5..ad9cf03f869 100644
--- a/src/test/regress/sql/btree_merge.sql
+++ b/src/test/regress/sql/btree_merge.sql
@@ -81,6 +81,15 @@ ANALYSE btree_merge_test;
 SET enable_seqscan = OFF;
 SET enable_bitmapscan = OFF;
 SHOW track_counts;  -- should be 'on'
+
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+
 -- From the limited query proposition this can be computed with 10
 -- tupple accesses.
 SELECT x, y
-- 
2.40.0



  [application/octet-stream] 0001-MERGE-SCAN-Test-the-baseline.patch (7.5K, 5-0001-MERGE-SCAN-Test-the-baseline.patch)
  download | inline diff:
From 6dc67b16668edc64dd820c5a313c849cd47da6c3 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 08:35:15 +0000
Subject: [PATCH 1/3] [MERGE-SCAN]: Test the baseline

---
 src/test/regress/expected/btree_merge.out | 113 ++++++++++++++++++++++
 src/test/regress/sql/btree_merge.sql      | 100 +++++++++++++++++++
 2 files changed, 213 insertions(+)
 create mode 100644 src/test/regress/expected/btree_merge.out
 create mode 100644 src/test/regress/sql/btree_merge.sql

diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
new file mode 100644
index 00000000000..441ae1d0657
--- /dev/null
+++ b/src/test/regress/expected/btree_merge.out
@@ -0,0 +1,113 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points 
+--     SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+--     i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+--     SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+--     i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable): 
+-- S(x, IN X, y >= b) is the K-way merge of the sequences 
+-- {S(x=x[i], y >= b), x[i] in X}
+-- 
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of 
+--    SELECT * FROM X 
+--    JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as 
+--    SELECT * FROM X,
+--    LATERAL (
+--        (SELECT * FROM S 
+--           WHERE x = x[i] AND y BETWEEN b1 AND b2
+--        ) AS subscan[i]
+--    ) as merged
+-- 
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most 
+--  (count(1) FROM subscan[i]) + 1    -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort 
+-- whose number of rows is
+--   sum(count(1) FROM subscan[i])    -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+--   S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most 
+--   N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+--   S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+--   N + count(distinct X) - 1
+-- Q.E.D.
+-- Generate a table with lattice points 
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+    SELECT x, y FROM 
+        generate_series(1, 50) AS x, 
+        generate_series(1, 50) AS y
+    ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+ANALYSE btree_merge_test;
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts;  -- should be 'on'
+ track_counts 
+--------------
+ on
+(1 row)
+
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+ x | y  
+---+----
+ 1 | 19
+ 2 | 19
+ 5 | 19
+(3 rows)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch 
+FROM pg_stat_user_indexes 
+WHERE indexrelname = 'btree_merge_test_idx';
+ idx_scan | idx_tup_read | idx_tup_fetch 
+----------+--------------+---------------
+        5 |           10 |            10
+(1 row)
+
+DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
new file mode 100644
index 00000000000..be00c33c2a5
--- /dev/null
+++ b/src/test/regress/sql/btree_merge.sql
@@ -0,0 +1,100 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points 
+--     SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+--     i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+--     SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+--     i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable): 
+-- S(x, IN X, y >= b) is the K-way merge of the sequences 
+-- {S(x=x[i], y >= b), x[i] in X}
+-- 
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of 
+--    SELECT * FROM X 
+--    JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as 
+--    SELECT * FROM X,
+--    LATERAL (
+--        (SELECT * FROM S 
+--           WHERE x = x[i] AND y BETWEEN b1 AND b2
+--        ) AS subscan[i]
+--    ) as merged
+-- 
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most 
+--  (count(1) FROM subscan[i]) + 1    -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort 
+-- whose number of rows is
+--   sum(count(1) FROM subscan[i])    -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+--   S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most 
+--   N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+--   S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+--   N + count(distinct X) - 1
+-- Q.E.D.
+
+
+-- Generate a table with lattice points 
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+    SELECT x, y FROM 
+        generate_series(1, 50) AS x, 
+        generate_series(1, 50) AS y
+    ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+
+ANALYSE btree_merge_test;
+
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts;  -- should be 'on'
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+
+
+SELECT pg_stat_force_next_flush();
+
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch 
+FROM pg_stat_user_indexes 
+WHERE indexrelname = 'btree_merge_test_idx';
+
+DROP TABLE btree_merge_test;
\ No newline at end of file
-- 
2.40.0

view thread (12+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: New access method for b-tree.
  In-Reply-To: <CAE8JnxNM9Bh=LCGOzayewDgX3-kUNXdTDwNSDwsf+t=wKhPiCQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox