public inbox for [email protected]
help / color / mirror / Atom feedNew access method for b-tree.
12+ messages / 6 participants
[nested] [flat]
* New access method for b-tree.
@ 2026-02-01 10:02 Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
0 siblings, 1 reply; 12+ messages in thread
From: Alexandre Felipe @ 2026-02-01 10:02 UTC (permalink / raw)
To: pgsql-hackers
Hello Hackers,
Please check this out,
It is an access method to scan a table sorting by the second column of an index, filtered on the first.
Queries like
SELECT x, y FROM grid
WHERE x in (array of Nx elements)
ORDER BY y, x
LIMIT M
Can execute streaming the rows directly from disk instead of loading everything.
Using btree index on (x, y)
On a grid with N x N will run by fetching only what is necessary
A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx * log( N * Nx)) comput (assuming a generic in memory sort)
The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M * log(Nx)) compute.
Kind Regards,
Alexandre Felipe
Research & Development Engineer
Attachments:
[application/octet-stream] btree_merge_rebased.patch (54.5K, 3-btree_merge_rebased.patch)
download | inline diff:
diff --git a/.gitignore b/.gitignore
index 4e911395fe3..ac1f95d9cf0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,11 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+# hidden files (e.g. .dbdata, .install, good practice to test locally in isolation)
+.*
+
+# Test output
+**/regression.diffs
+**/regression.out
+**/results/
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 0daf640af96..72053cefdaa 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -16,6 +16,7 @@ OBJS = \
nbtcompare.o \
nbtdedup.o \
nbtinsert.o \
+ nbtmergescan.o \
nbtpage.o \
nbtpreprocesskeys.o \
nbtreadpage.o \
diff --git a/src/backend/access/nbtree/meson.build b/src/backend/access/nbtree/meson.build
index 812f067e710..1016fea62d5 100644
--- a/src/backend/access/nbtree/meson.build
+++ b/src/backend/access/nbtree/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'nbtcompare.c',
'nbtdedup.c',
'nbtinsert.c',
+ 'nbtmergescan.c',
'nbtpage.c',
'nbtpreprocesskeys.c',
'nbtreadpage.c',
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
new file mode 100644
index 00000000000..70828dc73d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -0,0 +1,457 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtmergescan.c
+ * B-Tree merge scan for efficient evaluation of IN-list queries
+ *
+ * This module implements a K-way merge scan for B-tree indexes, optimized
+ * for queries of the form:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ *
+ * The algorithm maintains a min-heap of cursors, one per prefix value.
+ * Each cursor tracks its position within the index for that prefix.
+ * Tuples are returned in suffix order by repeatedly extracting the
+ * minimum from the heap.
+ *
+ * Target behavior: Access at most N + K - 1 index tuples for LIMIT N.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtmergescan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Forward declarations of static functions */
+static int bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg);
+static bool bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull);
+static bool bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor);
+static Datum bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull);
+
+
+/*
+ * bt_merge_heap_cmp
+ * Compare two cursors by their current sort key (suffix value).
+ *
+ * When sort keys are equal, uses prefix value as tiebreaker for
+ * deterministic ordering (ORDER BY suffix, prefix).
+ *
+ * Returns positive if a > b (pairingheap is a max-heap, we want min-heap
+ * behavior so we invert the comparison).
+ */
+static int
+bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg)
+{
+ BTMergeScanState *state = (BTMergeScanState *) arg;
+ BTMergeCursor *cursor_a = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) a);
+ BTMergeCursor *cursor_b = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) b);
+ Datum key_a = cursor_a->sort_key;
+ Datum key_b = cursor_b->sort_key;
+ bool null_a = cursor_a->sort_key_isnull;
+ bool null_b = cursor_b->sort_key_isnull;
+ int32 cmp;
+
+ /* Handle NULLs - NULLs sort last (NULLS LAST default for ASC) */
+ if (null_a && null_b)
+ return 0;
+ if (null_a)
+ return -1; /* a is NULL, comes after b */
+ if (null_b)
+ return 1; /* b is NULL, comes after a */
+
+ /* Compare using the suffix column's comparison function */
+ cmp = DatumGetInt32(FunctionCall2Coll(&state->suffix_cmp,
+ state->suffix_collation,
+ key_a, key_b));
+
+ /*
+ * Use prefix value as tiebreaker for deterministic ordering.
+ * This ensures ORDER BY suffix, prefix behavior.
+ */
+ if (cmp == 0)
+ {
+ /* Compare prefix values (assumes pass-by-value int4 for now) */
+ int32 prefix_a = DatumGetInt32(cursor_a->prefix_value);
+ int32 prefix_b = DatumGetInt32(cursor_b->prefix_value);
+
+ if (prefix_a < prefix_b)
+ cmp = -1;
+ else if (prefix_a > prefix_b)
+ cmp = 1;
+ }
+
+ /* Negate for min-heap behavior */
+ return -cmp;
+}
+
+
+/*
+ * bt_merge_init
+ * Initialize a merge scan state.
+ *
+ * Creates the merge state with one cursor per prefix value.
+ * The cursors will be positioned at their first matching tuples
+ * when bt_merge_getnext is first called.
+ */
+BTMergeScanState *
+bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation)
+{
+ BTMergeScanState *state;
+ MemoryContext merge_context;
+ MemoryContext old_context;
+ int i;
+
+ /* Create memory context for merge scan allocations */
+ merge_context = AllocSetContextCreate(CurrentMemoryContext,
+ "BTMergeScan",
+ ALLOCSET_DEFAULT_SIZES);
+ old_context = MemoryContextSwitchTo(merge_context);
+
+ /* Allocate main state structure */
+ state = palloc0(sizeof(BTMergeScanState));
+ state->merge_context = merge_context;
+ state->num_cursors = num_prefixes;
+ state->active_cursors = 0;
+ state->prefix_attno = prefix_attno;
+ state->suffix_attno = suffix_attno;
+ state->suffix_collation = suffix_collation;
+ state->direction = ForwardScanDirection;
+ state->initialized = false;
+ state->tuples_accessed = 0;
+
+ /* Set up suffix comparison function */
+ fmgr_info(suffix_cmp_oid, &state->suffix_cmp);
+
+ /* Allocate cursor array */
+ state->cursors = palloc0(num_prefixes * sizeof(BTMergeCursor));
+
+ /* Initialize cursor metadata (not positioned yet) */
+ for (i = 0; i < num_prefixes; i++)
+ {
+ BTMergeCursor *cursor = &state->cursors[i];
+
+ cursor->cursor_id = i;
+ cursor->prefix_value = datumCopy(prefix_values[i], true, sizeof(Datum));
+ cursor->prefix_isnull = prefix_nulls[i];
+ cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
+ cursor->sort_key_isnull = true;
+ BTScanPosInvalidate(cursor->pos);
+ cursor->tuples = NULL;
+ }
+
+ /* Initialize the merge heap */
+ state->merge_heap = pairingheap_allocate(bt_merge_heap_cmp, state);
+
+ MemoryContextSwitchTo(old_context);
+
+ return state;
+}
+
+
+/*
+ * bt_merge_getnext
+ * Get the next tuple from the merge scan.
+ *
+ * Returns true if a tuple was found, false if scan is exhausted.
+ * The tuple's TID is stored in scan->xs_heaptid.
+ */
+bool
+bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTMergeScanState *state = so->mergeState;
+ BTMergeCursor *cursor;
+ pairingheap_node *node;
+ int i;
+
+ if (state == NULL)
+ return false;
+
+ /* Initialize cursors on first call */
+ if (!state->initialized)
+ {
+ state->initialized = true;
+ state->direction = dir;
+
+ for (i = 0; i < state->num_cursors; i++)
+ {
+ BTMergeCursor *c = &state->cursors[i];
+
+ if (!c->exhausted &&
+ bt_merge_cursor_init(state, scan, c,
+ c->prefix_value, c->prefix_isnull))
+ {
+ /* Cursor has at least one tuple, add to heap */
+ pairingheap_add(state->merge_heap, &c->ph_node);
+ state->active_cursors++;
+ }
+ }
+ }
+
+ /* Get the cursor with the smallest suffix value */
+ if (pairingheap_is_empty(state->merge_heap))
+ return false;
+
+ node = pairingheap_remove_first(state->merge_heap);
+ cursor = pairingheap_container(BTMergeCursor, ph_node, node);
+
+ /* Set up the heap TID from the current cursor position */
+ Assert(BTScanPosIsValid(cursor->pos));
+ scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+
+ /* Advance cursor to next tuple */
+ if (bt_merge_cursor_advance(state, scan, cursor))
+ {
+ /* Cursor still has tuples, re-add to heap */
+ pairingheap_add(state->merge_heap, &cursor->ph_node);
+ }
+ else
+ {
+ /* Cursor exhausted */
+ state->active_cursors--;
+ }
+
+ return true;
+}
+
+
+/*
+ * bt_merge_end
+ * Clean up merge scan state.
+ */
+void
+bt_merge_end(BTMergeScanState *state)
+{
+ if (state == NULL)
+ return;
+
+ /* Free the memory context, which frees all allocations */
+ MemoryContextDelete(state->merge_context);
+}
+
+
+/*
+ * bt_merge_cursor_init
+ * Initialize a cursor and position it at the first matching tuple.
+ *
+ * Returns true if the cursor found at least one matching tuple.
+ */
+static bool
+bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found;
+
+ if (prefix_isnull)
+ {
+ cursor->exhausted = true;
+ return false;
+ }
+
+ /*
+ * Modify the scan key to use this cursor's prefix value.
+ * We reuse the scan's existing key infrastructure.
+ */
+ for (int i = 0; i < so->numberOfKeys; i++)
+ {
+ if (so->keyData[i].sk_attno == state->prefix_attno)
+ {
+ so->keyData[i].sk_argument = prefix_value;
+ so->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
+ break;
+ }
+ }
+
+ /* Invalidate current position to force _bt_first */
+ BTScanPosInvalidate(so->currPos);
+
+ /* Disable array key handling for this cursor's scan */
+ so->numArrayKeys = 0;
+
+ /* Position at first matching tuple */
+ found = _bt_first(scan, state->direction);
+
+ if (found)
+ {
+ /* Copy position to cursor */
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ /* Extract the sort key for heap ordering */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ cursor->exhausted = false;
+
+ /* Count this as a tuple access */
+ state->tuples_accessed++;
+
+ /* Invalidate main scan position */
+ BTScanPosInvalidate(so->currPos);
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_cursor_advance
+ * Advance a cursor to its next tuple.
+ *
+ * Returns true if the cursor now points to a valid tuple, false if exhausted.
+ */
+static bool
+bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+
+ if (cursor->exhausted)
+ return false;
+
+ /* Try to move to next tuple within current page's items array */
+ if (state->direction == ForwardScanDirection)
+ {
+ if (cursor->pos.itemIndex < cursor->pos.lastItem)
+ {
+ cursor->pos.itemIndex++;
+ found = true;
+ }
+ }
+ else
+ {
+ if (cursor->pos.itemIndex > cursor->pos.firstItem)
+ {
+ cursor->pos.itemIndex--;
+ found = true;
+ }
+ }
+
+ if (!found)
+ {
+ /*
+ * Current page exhausted. Use _bt_next to get the next page.
+ * We swap our cursor's position into the scan's currPos,
+ * call _bt_next, then swap back.
+ */
+ BTScanPosData save_pos;
+
+ memcpy(&save_pos, &so->currPos, sizeof(BTScanPosData));
+ memcpy(&so->currPos, &cursor->pos, sizeof(BTScanPosData));
+
+ found = _bt_next(scan, state->direction);
+
+ if (found)
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ memcpy(&so->currPos, &save_pos, sizeof(BTScanPosData));
+ }
+
+ if (found)
+ {
+ /* Extract new sort key */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ state->tuples_accessed++;
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_extract_sortkey
+ * Extract the sort key (suffix column value) from the current tuple.
+ */
+static Datum
+bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull)
+{
+ Relation rel = scan->indexRelation;
+ Buffer buf;
+ Page page;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ TupleDesc tupdesc;
+ Datum result;
+
+ if (cursor->pos.currPage == InvalidBlockNumber)
+ {
+ *isnull = true;
+ return (Datum) 0;
+ }
+
+ /* Read the page */
+ buf = ReadBuffer(rel, cursor->pos.currPage);
+ LockBuffer(buf, BT_READ);
+ page = BufferGetPage(buf);
+
+ offnum = cursor->pos.items[cursor->pos.itemIndex].indexOffset;
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ tupdesc = RelationGetDescr(rel);
+
+ /* Extract the suffix column value */
+ result = index_getattr(itup, state->suffix_attno, tupdesc, isnull);
+
+ /* Copy pass-by-reference values before releasing buffer */
+ if (!*isnull)
+ {
+ Form_pg_attribute attr = TupleDescAttr(tupdesc, state->suffix_attno - 1);
+
+ if (!attr->attbyval)
+ result = datumCopy(result, attr->attbyval, attr->attlen);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ return result;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 77224859685..0d4e7440760 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -20,6 +20,7 @@
#include "catalog/pg_am_d.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
+#include "lib/pairingheap.h"
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
@@ -1050,6 +1051,49 @@ typedef struct BTArrayKeyInfo
ScanKey high_compare; /* array's < or <= upper bound */
} BTArrayKeyInfo;
+/*
+ * BTMergeCursor - tracks scan state for one prefix value in merge scan
+ *
+ * Each cursor maintains its own position within the index for a specific
+ * prefix value. Cursors are organized in a min-heap ordered by their
+ * current suffix key value for efficient K-way merge.
+ */
+typedef struct BTMergeCursor
+{
+ pairingheap_node ph_node; /* pairing heap node for merge */
+ int cursor_id; /* index in merge state's cursors array */
+ Datum prefix_value; /* the prefix value for this sub-scan */
+ bool prefix_isnull; /* is prefix value NULL? */
+ Datum sort_key; /* current tuple's sort key (suffix) */
+ bool sort_key_isnull;/* is sort key NULL? */
+ bool exhausted; /* no more tuples for this prefix */
+ BTScanPosData pos; /* current position in index */
+ char *tuples; /* tuple storage workspace (BLCKSZ) */
+} BTMergeCursor;
+
+/*
+ * BTMergeScanState - state for K-way merge scan
+ *
+ * This structure manages multiple cursors for a merge scan, allowing
+ * lazy evaluation of queries like:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ */
+typedef struct BTMergeScanState
+{
+ int num_cursors; /* number of prefix values (K) */
+ int active_cursors; /* cursors not yet exhausted */
+ BTMergeCursor *cursors; /* array of cursors */
+ pairingheap *merge_heap; /* min-heap ordered by sort_key */
+ int prefix_attno; /* attribute number of prefix column (1-based) */
+ int suffix_attno; /* attribute number of suffix column (1-based) */
+ FmgrInfo suffix_cmp; /* comparison function for suffix */
+ Oid suffix_collation; /* collation for suffix comparison */
+ ScanDirection direction; /* scan direction */
+ bool initialized; /* have cursors been initialized? */
+ MemoryContext merge_context;/* memory context for allocations */
+ int64 tuples_accessed;/* count of index tuples accessed */
+} BTMergeScanState;
+
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
@@ -1089,6 +1133,12 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /*
+ * Merge scan state, if using merge scan optimization.
+ * NULL if not using merge scan.
+ */
+ BTMergeScanState *mergeState;
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -1334,4 +1384,18 @@ extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+/*
+ * prototypes for functions in nbtmergescan.c
+ */
+extern BTMergeScanState *bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation);
+extern bool bt_merge_getnext(IndexScanDesc scan, ScanDirection dir);
+extern void bt_merge_end(BTMergeScanState *state);
+
#endif /* NBTREE_H */
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2634a519935..b7b802bfdde 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -18,6 +18,7 @@ subdir('ssl_passphrase_callback')
subdir('test_aio')
subdir('test_binaryheap')
subdir('test_bitmapset')
+subdir('test_btree_merge')
subdir('test_bloomfilter')
subdir('test_cloexec')
subdir('test_copy_callbacks')
diff --git a/src/test/modules/test_btree_merge/Makefile b/src/test/modules/test_btree_merge/Makefile
new file mode 100644
index 00000000000..540416a2c91
--- /dev/null
+++ b/src/test/modules/test_btree_merge/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_btree_merge/Makefile
+
+MODULE_big = test_btree_merge
+OBJS = \
+ $(WIN32RES) \
+ test_btree_merge.o
+
+PGFILEDESC = "test_btree_merge - test code for btree merge scan"
+
+EXTENSION = test_btree_merge
+DATA = test_btree_merge--1.0.sql
+
+REGRESS = test_btree_merge
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_btree_merge
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_btree_merge/expected/test_btree_merge.out b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
new file mode 100644
index 00000000000..baf4d7937e0
--- /dev/null
+++ b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
@@ -0,0 +1,243 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+CREATE EXTENSION test_btree_merge;
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+ test_name
+----------------------------------
+ Test 1: Basic integer merge scan
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+SELECT 'Test 2: More prefix values' AS test_name;
+ test_name
+----------------------------
+ Test 2: More prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 3 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+SELECT 'Test 3: Single prefix value' AS test_name;
+ test_name
+-----------------------------
+ Test 3: Single prefix value
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 6 | 5
+(1 row)
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+SELECT 'Test 4: Large LIMIT' AS test_name;
+ test_name
+---------------------
+ Test 4: Large LIMIT
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 153 | 153 | 153
+(1 row)
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+ test_name
+--------------------------------------
+ Test 5: Non-contiguous prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+ test_name
+--------------------------
+ Test 6: Timestamp suffix
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+SELECT 'Test 7: All users timestamp' AS test_name;
+ test_name
+-----------------------------
+ Test 7: All users timestamp
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 10 | 15 | 14
+(1 row)
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+SELECT 'Test 8: Correctness verification' AS test_name;
+ test_name
+----------------------------------
+ Test 8: Correctness verification
+(1 row)
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+ status | rn | prefix_col | suffix_col | expected_prefix | expected_suffix
+--------+----+------------+------------+-----------------+-----------------
+(0 rows)
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/meson.build b/src/test/modules/test_btree_merge/meson.build
new file mode 100644
index 00000000000..665d6cf443e
--- /dev/null
+++ b/src/test/modules/test_btree_merge/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_btree_merge_sources = files(
+ 'test_btree_merge.c',
+)
+
+if host_system == 'windows'
+ test_btree_merge_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_btree_merge',
+ '--FILEDESC', 'test_btree_merge - test code for btree merge scan',])
+endif
+
+test_btree_merge = shared_module('test_btree_merge',
+ test_btree_merge_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_btree_merge
+
+test_install_data += files(
+ 'test_btree_merge.control',
+ 'test_btree_merge--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_btree_merge',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_btree_merge',
+ ],
+ },
+}
diff --git a/src/test/modules/test_btree_merge/sql/test_btree_merge.sql b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
new file mode 100644
index 00000000000..5828b343b34
--- /dev/null
+++ b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
@@ -0,0 +1,207 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+
+CREATE EXTENSION test_btree_merge;
+
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+
+
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 2: More prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+
+SELECT 'Test 3: Single prefix value' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+
+SELECT 'Test 4: Large LIMIT' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+
+SELECT 'Test 7: All users timestamp' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+
+SELECT 'Test 8: Correctness verification' AS test_name;
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
new file mode 100644
index 00000000000..9872947d7d7
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
@@ -0,0 +1,43 @@
+/* src/test/modules/test_btree_merge/test_btree_merge--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_btree_merge" to load this file. \quit
+
+-- Test merge scan with integer columns
+CREATE FUNCTION test_btree_merge_scan_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Fetch actual rows from merge scan (for correctness verification)
+CREATE FUNCTION test_btree_merge_fetch_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ prefix_col int4,
+ suffix_col int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Test merge scan with timestamp suffix
+CREATE FUNCTION test_btree_merge_scan_ts(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start timestamp,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.c b/src/test/modules/test_btree_merge/test_btree_merge.c
new file mode 100644
index 00000000000..78b22130ecf
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.c
@@ -0,0 +1,389 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_btree_merge.c
+ * Unit tests for B-tree Merge Scan implementation
+ *
+ * This module provides SQL-callable functions to directly test the
+ * merge scan algorithm without going through the planner.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
+#include "commands/defrem.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define MAX_RESULTS 10000
+
+/*
+ * MergeScanResult - holds results from a merge scan execution
+ */
+typedef struct MergeScanResult
+{
+ int tuples_returned;
+ int64 tuples_accessed;
+ int num_prefixes;
+ int limit_count;
+ /* For fetch function: collected row data */
+ int32 *prefixes;
+ int32 *suffixes;
+} MergeScanResult;
+
+/*
+ * do_merge_scan - common merge scan execution
+ *
+ * Performs a merge scan with the given parameters and collects results.
+ * If collect_rows is true, fetches and stores actual row data.
+ */
+static void
+do_merge_scan(const char *table_name,
+ const char *index_name,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ Datum suffix_start,
+ Oid suffix_type,
+ RegProcedure suffix_eq_proc,
+ RegProcedure suffix_ge_proc,
+ int limit_count,
+ bool collect_rows,
+ MergeScanResult *result)
+{
+ Oid table_oid;
+ Oid index_oid;
+ Relation heap_rel;
+ Relation index_rel;
+ IndexScanDesc scan;
+ BTScanOpaque so;
+ BTMergeScanState *merge_state;
+ Snapshot snapshot;
+ Oid suffix_cmp_oid;
+ Oid opfamily;
+ const char *opfamily_name;
+ int tuples_returned = 0;
+ int max_results;
+
+ /* Determine operator family based on suffix type */
+ if (suffix_type == INT4OID)
+ opfamily_name = "integer_ops";
+ else if (suffix_type == TIMESTAMPOID)
+ opfamily_name = "datetime_ops";
+ else
+ elog(ERROR, "unsupported suffix type: %u", suffix_type);
+
+ /* Look up table and index */
+ table_oid = RelnameGetRelid(table_name);
+ if (!OidIsValid(table_oid))
+ elog(ERROR, "table \"%s\" does not exist", table_name);
+
+ index_oid = RelnameGetRelid(index_name);
+ if (!OidIsValid(index_oid))
+ elog(ERROR, "index \"%s\" does not exist", index_name);
+
+ /* Open relations */
+ heap_rel = table_open(table_oid, AccessShareLock);
+ index_rel = index_open(index_oid, AccessShareLock);
+
+ /* Get comparison function for suffix type */
+ opfamily = get_opfamily_oid(BTREE_AM_OID,
+ list_make1(makeString(pstrdup(opfamily_name))),
+ false);
+ suffix_cmp_oid = get_opfamily_proc(opfamily, suffix_type, suffix_type,
+ BTORDER_PROC);
+ if (!OidIsValid(suffix_cmp_oid))
+ elog(ERROR, "could not find comparison function for type %u", suffix_type);
+
+ /* Begin index scan */
+ snapshot = GetActiveSnapshot();
+ scan = index_beginscan(heap_rel, index_rel, snapshot, NULL, 2, 0);
+
+ /* Set up scan keys */
+ {
+ ScanKeyData keys[2];
+
+ ScanKeyInit(&keys[0], 1, BTEqualStrategyNumber, suffix_eq_proc,
+ prefix_values[0]);
+ ScanKeyInit(&keys[1], 2, BTGreaterEqualStrategyNumber, suffix_ge_proc,
+ suffix_start);
+ index_rescan(scan, keys, 2, NULL, 0);
+ }
+
+ so = (BTScanOpaque) scan->opaque;
+
+ /* Initialize merge scan */
+ merge_state = bt_merge_init(scan, prefix_values, prefix_nulls,
+ num_prefixes, 1, 2, suffix_cmp_oid, InvalidOid);
+ so->mergeState = merge_state;
+
+ /* Execute scan */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+
+ while (tuples_returned < max_results)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (!bt_merge_getnext(scan, ForwardScanDirection))
+ break;
+
+ if (collect_rows && result->prefixes != NULL)
+ {
+ /* Fetch heap tuple to get actual values */
+ HeapTupleData heapTuple;
+ Buffer heapBuffer;
+ bool isnull;
+
+ heapTuple.t_self = scan->xs_heaptid;
+ if (heap_fetch(heap_rel, snapshot, &heapTuple, &heapBuffer, false))
+ {
+ result->prefixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 1,
+ RelationGetDescr(heap_rel), &isnull));
+ result->suffixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 2,
+ RelationGetDescr(heap_rel), &isnull));
+ ReleaseBuffer(heapBuffer);
+ }
+ }
+
+ tuples_returned++;
+
+ if (tuples_returned >= MAX_RESULTS)
+ {
+ elog(WARNING, "merge scan hit safety limit of %d tuples", MAX_RESULTS);
+ break;
+ }
+ }
+
+ /* Collect results before cleanup */
+ result->tuples_returned = tuples_returned;
+ result->tuples_accessed = merge_state->tuples_accessed;
+ result->num_prefixes = num_prefixes;
+ result->limit_count = limit_count;
+
+ /* Clean up */
+ bt_merge_end(merge_state);
+ so->mergeState = NULL;
+ index_endscan(scan);
+ index_close(index_rel, AccessShareLock);
+ table_close(heap_rel, AccessShareLock);
+}
+
+/*
+ * build_stats_result - build the stats result tuple
+ */
+static Datum
+build_stats_result(FunctionCallInfo fcinfo, MergeScanResult *result)
+{
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+ HeapTuple tuple;
+ int max_required_fetches;
+
+ /* Calculate expected max fetches */
+ if (result->tuples_returned < result->limit_count)
+ max_required_fetches = result->tuples_returned;
+ else
+ max_required_fetches = result->limit_count + result->num_prefixes - 1;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context "
+ "that cannot accept type record")));
+
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ values[0] = Int32GetDatum(result->tuples_returned);
+ values[1] = Int32GetDatum((int32) result->tuples_accessed);
+ values[2] = Int32GetDatum(max_required_fetches);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+
+/*
+ * test_btree_merge_scan_int - test merge scan with integer columns
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_int);
+
+Datum
+test_btree_merge_scan_int(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_scan_ts - test merge scan with timestamp suffix
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_ts);
+
+Datum
+test_btree_merge_scan_ts(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ Timestamp suffix_start = PG_GETARG_TIMESTAMP(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ TimestampGetDatum(suffix_start), TIMESTAMPOID,
+ F_INT4EQ, F_TIMESTAMP_GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_fetch_int - fetch actual rows from merge scan
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_fetch_int);
+
+Datum
+test_btree_merge_fetch_int(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+
+ typedef struct
+ {
+ int32 *prefixes;
+ int32 *suffixes;
+ int num_results;
+ int current_idx;
+ } FetchContext;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MemoryContext oldcontext;
+ FetchContext *fctx;
+ MergeScanResult result = {0};
+ TupleDesc tupdesc;
+ int max_results;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ /* Allocate result storage */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+ fctx = palloc(sizeof(FetchContext));
+ fctx->prefixes = palloc(max_results * sizeof(int32));
+ fctx->suffixes = palloc(max_results * sizeof(int32));
+ fctx->current_idx = 0;
+
+ /* Point result to our storage */
+ result.prefixes = fctx->prefixes;
+ result.suffixes = fctx->suffixes;
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, true, &result);
+
+ fctx->num_results = result.tuples_returned;
+
+ /* Build result tuple descriptor */
+ tupdesc = CreateTemplateTupleDesc(2);
+ TupleDescInitEntry(tupdesc, 1, "prefix_col", INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, 2, "suffix_col", INT4OID, -1, 0);
+ funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+ funcctx->user_fctx = fctx;
+
+ MemoryContextSwitchTo(oldcontext);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ {
+ FetchContext *fctx = funcctx->user_fctx;
+
+ if (fctx->current_idx < fctx->num_results)
+ {
+ Datum values[2];
+ bool nulls[2] = {false, false};
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->prefixes[fctx->current_idx]);
+ values[1] = Int32GetDatum(fctx->suffixes[fctx->current_idx]);
+ fctx->current_idx++;
+
+ tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+ }
+}
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.control b/src/test/modules/test_btree_merge/test_btree_merge.control
new file mode 100644
index 00000000000..f8146bd0f74
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.control
@@ -0,0 +1,5 @@
+# test_btree_merge extension
+comment = 'Unit tests for B-tree merge scan'
+default_version = '1.0'
+module_pathname = '$libdir/test_btree_merge'
+relocatable = true
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
new file mode 100644
index 00000000000..441ae1d0657
--- /dev/null
+++ b/src/test/regress/expected/btree_merge.out
@@ -0,0 +1,113 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+ANALYSE btree_merge_test;
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+ track_counts
+--------------
+ on
+(1 row)
+
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+ x | y
+---+----
+ 1 | 19
+ 2 | 19
+ 5 | 19
+(3 rows)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush
+--------------------------
+
+(1 row)
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+ idx_scan | idx_tup_read | idx_tup_fetch
+----------+--------------+---------------
+ 5 | 10 | 10
+(1 row)
+
+DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
new file mode 100644
index 00000000000..be00c33c2a5
--- /dev/null
+++ b/src/test/regress/sql/btree_merge.sql
@@ -0,0 +1,100 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+
+
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+
+ANALYSE btree_merge_test;
+
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+
+
+SELECT pg_stat_force_next_flush();
+
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+
+DROP TABLE btree_merge_test;
\ No newline at end of file
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
@ 2026-02-01 23:54 ` Tomas Vondra <[email protected]>
2026-02-03 16:01 ` Re: New access method for b-tree. Matthias van de Meent <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
0 siblings, 2 replies; 12+ messages in thread
From: Tomas Vondra @ 2026-02-01 23:54 UTC (permalink / raw)
To: Alexandre Felipe <[email protected]>; pgsql-hackers
Hello Felipe,
On 2/1/26 11:02, Alexandre Felipe wrote:
> Hello Hackers,
>
> Please check this out,
>
> It is an access method to scan a table sorting by the second column of
> an index, filtered on the first.
> Queries like
> SELECT x, y FROM grid
> WHERE x in (array of Nx elements)
> ORDER BY y, x
> LIMIT M
>
> Can execute streaming the rows directly from disk instead of loading
> everything.
>
> Using btree index on (x, y)
>
> On a grid with N x N will run by fetching only what is necessary
> A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx *
> log( N * Nx)) comput (assuming a generic in memory sort)
>
> The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M *
> log(Nx)) compute.
>
So how does this compare to skip scan in practice? It's hard to compare,
as the patch does not implement an actual access path, but I tried this:
CREATE TABLE merge_test_int (
prefix_col int4,
suffix_col int4
);
INSERT INTO merge_test_int
SELECT p, s
FROM generate_series(1, 10000) AS p,
generate_series(1, 1000) AS s;
CREATE INDEX merge_test_int_idx
ON merge_test_int (prefix_col, suffix_col);
and then
1) master
SELECT * FROM merge_test_int
WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
AND suffix_col >= 900
ORDER BY suffix_col LIMIT 100;
vs.
2) merge scan
SELECT * FROM test_btree_merge_scan_int(
'merge_test_int',
'merge_test_int_idx',
ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
900,
100);
And with explain analyze we get this:
1) Buffers: shared hit=26 read=25
2) Buffers: shared hit=143 read=17
So it seems to access many more buffers, even if the number of reads is
lower. Presumably the merge scan is not always better than skip scan,
probably depending on number of prefixes in the query etc. What is the
cost model to decide between those two?
If you had to construct the best case and worst cases (vs. skip scan),
what would that look like?
I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?
regards
--
Tomas Vondra
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
@ 2026-02-03 16:01 ` Matthias van de Meent <[email protected]>
2026-02-03 22:25 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
1 sibling, 1 reply; 12+ messages in thread
From: Matthias van de Meent @ 2026-02-03 16:01 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: Alexandre Felipe <[email protected]>; pgsql-hackers
On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <[email protected]> wrote:
>
> Hello Felipe,
>
> On 2/1/26 11:02, Alexandre Felipe wrote:
> > Hello Hackers,
> >
> > Please check this out,
> >
> > It is an access method to scan a table sorting by the second column of
> > an index, filtered on the first.
> > Queries like
> > SELECT x, y FROM grid
> > WHERE x in (array of Nx elements)
> > ORDER BY y, x
> > LIMIT M
> >
> > Can execute streaming the rows directly from disk instead of loading
> > everything.
+1 for the idea, it does sound interesting. I haven't looked in depth
at the patch, so no comments on the execution yet.
> So how does this compare to skip scan in practice? It's hard to compare,
> as the patch does not implement an actual access path, but I tried this:
[...]
> 1) master
>
> SELECT * FROM merge_test_int
> WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
[...]
> 2) merge scan
>
> SELECT * FROM test_btree_merge_scan_int(
[...]
> ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[...]
> And with explain analyze we get this:
>
> 1) Buffers: shared hit=26 read=25
> 2) Buffers: shared hit=143 read=17
(FYI; your first query was missing "2" from it's IN list while it was
present in the merge scan input; this makes the difference worse by a
few pages)
> So it seems to access many more buffers, even if the number of reads is
> lower. Presumably the merge scan is not always better than skip scan,
> probably depending on number of prefixes in the query etc. What is the
> cost model to decide between those two?
Skip scan always returns data in index order, while this merge scan
would return tuples a suffix order. The cost model would thus weigh
the cost of sorting the result of an index skipscan against the cost
of doing a merge join on n_in_list_items distinct (partial) index
scans.
As for when you would benefit in buffers accessed: The merge scan
would mainly benefit in number of buffers accessed when the selected
prefix values are non-sequential, and the prefixes cover multiple
pages at a time, and when there is a LIMIT clause on the scan. Normal
btree index skip scan infrastructure efficiently prevents new index
descents into the index when the selected SAOP key ranges are directly
adjecent, while merge scan would generally do at least one index
descent for each of its N scan heads (*) - which in the proposed
prototype patch guarantees O(index depth * num scan heads) buffer
accesses.
(*) It is theoretically possible to reuse an earlier index descent if
the SAOP entry's key range of the last descent starts and ends on the
leaf page that the next SAOP entry's key range also starts on
(applying the ideas of 5bf748b86b to this new multi-headed index scan
mode), but that infrastructure doesn't seem to be in place in the
current patch. That commit is also why your buffer access count for
master is so low compared to the merge scan's; if your chosen list of
numbers was multiples of 5 (so that matching tuples are not all
sequential) you'd probably see much more comparable buffer access
counts.
> If you had to construct the best case and worst cases (vs. skip scan),
> what would that look like?
Presumably the best case would be:
-- mytable.a has very few distinct values (e.g. bool or enum);
mytable.b many distinct values (e.g. uuid)
SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;
which the index's merge scan would turn into an index scan that
behaves similar to the following, possibly with the merge join pushed
down into the index:
SELECT * FROM (
SELECT ... FROM mytable WHERE a = 1
UNION
SELECT ... FROM mytable WHERE a = 2
) ORDER BY b.
The worst case would be the opposite:
-- mytable.a has many distinct values (random uuid); mytable.b few
(e.g. boolean; enum)
SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b
As the merge scan maintains one internal indexscan head per SAOP array
element, it'd have significant in-memory and scan startup overhead,
while few values are produced for each of those scan heads.
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?
I'm not sure, but it seems like it might be common/useful in
queue-like access patterns:
With an index on (state, updated_timestamp) you're probably interested
in all messages in just a subset of states, ordered by recent state
transitions. An index on (updated_timestamp, state) might be
considered more optimal, but won't be able to efficiently serve
queries that only want data on uncommon states: The leaf pages would
mainly contain data on common states, reducing the value of those leaf
pages.
Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
using UNION, or add an index for each percievable IN list, but it'd be
great if the user didn't have to rewrite their query or create
n_combinations indexes with their respective space usage to get this
more efficient query execution.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 16:01 ` Re: New access method for b-tree. Matthias van de Meent <[email protected]>
@ 2026-02-03 22:25 ` Tomas Vondra <[email protected]>
0 siblings, 0 replies; 12+ messages in thread
From: Tomas Vondra @ 2026-02-03 22:25 UTC (permalink / raw)
To: Matthias van de Meent <[email protected]>; +Cc: Alexandre Felipe <[email protected]>; pgsql-hackers
On 2/3/26 17:01, Matthias van de Meent wrote:
> On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <[email protected]> wrote:
>>
>> Hello Felipe,
>>
>> On 2/1/26 11:02, Alexandre Felipe wrote:
>>> Hello Hackers,
>>>
>>> Please check this out,
>>>
>>> It is an access method to scan a table sorting by the second column of
>>> an index, filtered on the first.
>>> Queries like
>>> SELECT x, y FROM grid
>>> WHERE x in (array of Nx elements)
>>> ORDER BY y, x
>>> LIMIT M
>>>
>>> Can execute streaming the rows directly from disk instead of loading
>>> everything.
>
> +1 for the idea, it does sound interesting. I haven't looked in depth
> at the patch, so no comments on the execution yet.
>
>> So how does this compare to skip scan in practice? It's hard to compare,
>> as the patch does not implement an actual access path, but I tried this:
> [...]
>> 1) master
>>
>> SELECT * FROM merge_test_int
>> WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
> [...]
>> 2) merge scan
>>
>> SELECT * FROM test_btree_merge_scan_int(
> [...]
>> ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
> [...]
>> And with explain analyze we get this:
>>
>> 1) Buffers: shared hit=26 read=25
>> 2) Buffers: shared hit=143 read=17
>
> (FYI; your first query was missing "2" from it's IN list while it was
> present in the merge scan input; this makes the difference worse by a
> few pages)
>
>> So it seems to access many more buffers, even if the number of reads is
>> lower. Presumably the merge scan is not always better than skip scan,
>> probably depending on number of prefixes in the query etc. What is the
>> cost model to decide between those two?
>
> Skip scan always returns data in index order, while this merge scan
> would return tuples a suffix order. The cost model would thus weigh
> the cost of sorting the result of an index skipscan against the cost
> of doing a merge join on n_in_list_items distinct (partial) index
> scans.
>
Makes sense.
> As for when you would benefit in buffers accessed: The merge scan
> would mainly benefit in number of buffers accessed when the selected
> prefix values are non-sequential, and the prefixes cover multiple
> pages at a time, and when there is a LIMIT clause on the scan. Normal
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
>
Do we have sufficient information to reliably make the right decision?
Can we actually cost the two cases well enough?
> (*) It is theoretically possible to reuse an earlier index descent if
> the SAOP entry's key range of the last descent starts and ends on the
> leaf page that the next SAOP entry's key range also starts on
> (applying the ideas of 5bf748b86b to this new multi-headed index scan
> mode), but that infrastructure doesn't seem to be in place in the
> current patch. That commit is also why your buffer access count for
> master is so low compared to the merge scan's; if your chosen list of
> numbers was multiples of 5 (so that matching tuples are not all
> sequential) you'd probably see much more comparable buffer access
> counts.
>
>> If you had to construct the best case and worst cases (vs. skip scan),
>> what would that look like?
>
> Presumably the best case would be:
>
> -- mytable.a has very few distinct values (e.g. bool or enum);
> mytable.b many distinct values (e.g. uuid)
> SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;
>
> which the index's merge scan would turn into an index scan that
> behaves similar to the following, possibly with the merge join pushed
> down into the index:
>
> SELECT * FROM (
> SELECT ... FROM mytable WHERE a = 1
> UNION
> SELECT ... FROM mytable WHERE a = 2
> ) ORDER BY b.
>
>
> The worst case would be the opposite:
>
> -- mytable.a has many distinct values (random uuid); mytable.b few
> (e.g. boolean; enum)
> SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b
>
> As the merge scan maintains one internal indexscan head per SAOP array
> element, it'd have significant in-memory and scan startup overhead,
> while few values are produced for each of those scan heads.
>
OK. It'll be interesting to see how this performs in practice for the
whole gamut between the best and worst case.
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>
> I'm not sure, but it seems like it might be common/useful in
> queue-like access patterns:
>
> With an index on (state, updated_timestamp) you're probably interested
> in all messages in just a subset of states, ordered by recent state
> transitions. An index on (updated_timestamp, state) might be
> considered more optimal, but won't be able to efficiently serve
> queries that only want data on uncommon states: The leaf pages would
> mainly contain data on common states, reducing the value of those leaf
> pages.
>
> Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
> using UNION, or add an index for each percievable IN list, but it'd be
> great if the user didn't have to rewrite their query or create
> n_combinations indexes with their respective space usage to get this
> more efficient query execution.
>
I think the examples presented by Ants (with timeline view) are quite
plausible in practice.
regards
--
Tomas Vondra
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
@ 2026-02-03 21:42 ` Ants Aasma <[email protected]>
2026-02-03 22:41 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-04 07:13 ` Re: New access method for b-tree. Michał Kłeczek <[email protected]>
1 sibling, 2 replies; 12+ messages in thread
From: Ants Aasma @ 2026-02-03 21:42 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: Alexandre Felipe <[email protected]>; pgsql-hackers
On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?
I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.
My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:
SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;
The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.
The main problem I can see is that at planning time the cardinality of
the prefix array might not be known, and in theory could be in the
millions. Having millions of index scans open at the same time is not
viable, so the method needs to somehow degrade gracefully. The idea I
had is to pick some limit, based on work_mem and/or benchmarking, and
one the limit is hit, populate the first batch and then run the next
batch of index scans, merging with the first result. Or something like
that, I can imagine a few different ways to handle it with different
tradeoffs.
I can imagine that this would really nicely benefit from ReadStream'ification.
One other connection I see is with block nested loops. In a perfect
future PostgreSQL could run the following as a set of merged index
scans that terminate early:
SELECT posts.*
FROM follows f
JOIN posts p ON f.followed_id = p.user_id
WHERE f.follower_id = :userid
ORDER BY p.tstamp DESC LIMIT 100;
In practice this is not a huge issue - it's not that hard to transform
this to array_agg and = ANY subqueries.
Regards,
Ants Aasma
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
@ 2026-02-03 22:41 ` Tomas Vondra <[email protected]>
2026-03-20 13:44 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
1 sibling, 1 reply; 12+ messages in thread
From: Tomas Vondra @ 2026-02-03 22:41 UTC (permalink / raw)
To: Ants Aasma <[email protected]>; +Cc: Alexandre Felipe <[email protected]>; pgsql-hackers
On 2/3/26 22:42, Ants Aasma wrote:
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
>
Makes sense. I guess filtering products by category + order by price
could also produce this query pattern.
> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
> SELECT * FROM posts
> WHERE user_id = id
> ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
>
True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.
> The main problem I can see is that at planning time the cardinality of
> the prefix array might not be known, and in theory could be in the
> millions. Having millions of index scans open at the same time is not
> viable, so the method needs to somehow degrade gracefully. The idea I
> had is to pick some limit, based on work_mem and/or benchmarking, and
> one the limit is hit, populate the first batch and then run the next
> batch of index scans, merging with the first result. Or something like
> that, I can imagine a few different ways to handle it with different
> tradeoffs.
>
Doesn't the proposed merge scan have a similar issue? Because that will
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way. I'm afraid the
proposed batches execution will be rather complex, so I'd say v1 should
simply have a threshold, and do the full scan + sort for more items.
> I can imagine that this would really nicely benefit from ReadStream'ification.
>
Not sure, maybe.
> One other connection I see is with block nested loops. In a perfect
> future PostgreSQL could run the following as a set of merged index
> scans that terminate early:
>
> SELECT posts.*
> FROM follows f
> JOIN posts p ON f.followed_id = p.user_id
> WHERE f.follower_id = :userid
> ORDER BY p.tstamp DESC LIMIT 100;
>
> In practice this is not a huge issue - it's not that hard to transform
> this to array_agg and = ANY subqueries.
>
Automating that transformation seems quite non-trivial (to me).
regards
--
Tomas Vondra
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
2026-02-03 22:41 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
@ 2026-03-20 13:44 ` Alexandre Felipe <[email protected]>
0 siblings, 0 replies; 12+ messages in thread
From: Alexandre Felipe @ 2026-03-20 13:44 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: Ants Aasma <[email protected]>; Alexandre Felipe <[email protected]>; pgsql-hackers; [email protected]
Happy St. Patrick's day!
(this was sitting on my drafts)
Based on what I said said in previous emails I see alternative
proposals
#1 Make it simpler by not changing the index access methods.
#2 Make it optimal by not using generic index searches
and not keeping multiple open index scans.
and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.
Ants Aasma & Tomas Vondra
> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> > SELECT * FROM posts
> > WHERE user_id = id
> > ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >
> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>
An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.
> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.
It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?
ALTERNATIVE #1 - More efficient
Or to avoid having N open index scans we could (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page, if we reach
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
(3.b) when tuples for prefix are exhausted load load
page from (2.b)
Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.
I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.
I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.
Do you mean by an executor node that performs the query as if it was written
ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
result += (SELECT FROM table
WHERE prefix = _prefix AND qual(*)
ORDER BY suffix
LIMIT N)
return SELECT * FROM result
ORDER BY suffix
LIMIT N
This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.
Alexandre Felipe, Feb 6
> | Method | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge | 13 | 119 | 13 ms |
> | IndexScan | 15,308 | 525,310 | 3,409 ms |
This Prefix Batch Scan approach
hit=62 read=773, Execution Time: 80.815 ms
> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>
Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.
> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> > JOIN posts p ON f.followed_id = p.user_id
> > WHERE f.follower_id = :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and = ANY subqueries.
> >
Automating that transformation seems quite non-trivial (to me).
>
Well, not trivial. To give a rough idea.
wc -l *.patch
113 v2-0001-Test-the-baseline.patch
614 v2-0002-Access-method.patch
850 v2-0003-Planner-integration.patch
1958 v2-0004-Multi-column.patch
2439 v2-0005-Joins.patch
it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.
The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)
Regards,
Alexandre
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
@ 2026-02-04 07:13 ` Michał Kłeczek <[email protected]>
2026-02-05 06:59 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
1 sibling, 1 reply; 12+ messages in thread
From: Michał Kłeczek @ 2026-02-04 07:13 UTC (permalink / raw)
To: Ants Aasma <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Alexandre Felipe <[email protected]>; pgsql-hackers
> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
>
> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
> SELECT * FROM posts
> WHERE user_id = id
> ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
GIST can be used to handle this kind of queries as it supports multiple sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1] but as described there it does not play well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2].
I think it would be easier to implement and maintain than a new IAM (but I don’t have enough knowledge and experience to implement it myself)
[1] https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
[2] https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
—
Kind regards,
Michal
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
2026-02-04 07:13 ` Re: New access method for b-tree. Michał Kłeczek <[email protected]>
@ 2026-02-05 06:59 ` Alexandre Felipe <[email protected]>
2026-02-06 10:52 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
0 siblings, 1 reply; 12+ messages in thread
From: Alexandre Felipe @ 2026-02-05 06:59 UTC (permalink / raw)
To: Michał Kłeczek <[email protected]>; +Cc: Ants Aasma <[email protected]>; Tomas Vondra <[email protected]>; Alexandre Felipe <[email protected]>; pgsql-hackers
Thank you for looking into this.
Now we can execute a, still narrow, family queries!
Maybe it helps to see this as a *social network feeds*. Imagine a social
network, you have a few friends, or follow a few people, and you want to
see their updates ordered by date. For each user we have a different
combination of users that we have to display. But maybe, even having
hundreds of users we will only show the first 10.
There is a low hanging fruit on the skip scan, if we need N rows, and one
group already has M rows we could stop there.
If Nx is the number of friends, and M is the number of posts to show.
This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
instead of (Nx * N) followed by an (Nx * N) sort.
Where M = 10 and N is 1000 this is a significant improvement.
But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
heap operations.
If everything is on the same page the skip scan would win.
The cost estimation is probably far off.
I am also not considering the filters applied after this operator, and I
don't know if the planner infrastructure is able to adjust it by itself.
This is where I would like reviewer's feedback. I think that the planner
costs are something to be determined experimentally.
Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...)
ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
(a, b, c)
---
Kind Regards,
Alexandre
On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]> wrote:
>
>
> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?
>
>
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
>
>
> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
> SELECT * FROM posts
> WHERE user_id = id
> ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
>
>
> GIST can be used to handle this kind of queries as it supports multiple
> sort orders.
> The only problem is that GIST does not support ORDER BY column.
> One possible workaround is [1] but as described there it does not play
> well with partitioning.
> I’ve started drafting support for ORDER BY column in GIST - see [2].
> I think it would be easier to implement and maintain than a new IAM (but I
> don’t have enough knowledge and experience to implement it myself)
>
> [1]
> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
> [2]
> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>
> —
> Kind regards,
> Michal
>
Attachments:
[application/octet-stream] 0002-MERGE-SCAN-Access-method.patch (49.1K, 3-0002-MERGE-SCAN-Access-method.patch)
download | inline diff:
From d86b371499db011a36583d20963df68b09219190 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 14:27:18 +0000
Subject: [PATCH 2/3] [MERGE-SCAN]: Access method
---
.gitignore | 8 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/meson.build | 1 +
src/backend/access/nbtree/nbtmergescan.c | 457 ++++++++++++++++++
src/include/access/nbtree.h | 64 +++
src/test/modules/meson.build | 1 +
src/test/modules/test_btree_merge/Makefile | 24 +
.../expected/test_btree_merge.out | 243 ++++++++++
src/test/modules/test_btree_merge/meson.build | 33 ++
.../test_btree_merge/sql/test_btree_merge.sql | 207 ++++++++
.../test_btree_merge--1.0.sql | 43 ++
.../test_btree_merge/test_btree_merge.c | 389 +++++++++++++++
.../test_btree_merge/test_btree_merge.control | 5 +
13 files changed, 1476 insertions(+)
create mode 100644 src/backend/access/nbtree/nbtmergescan.c
create mode 100644 src/test/modules/test_btree_merge/Makefile
create mode 100644 src/test/modules/test_btree_merge/expected/test_btree_merge.out
create mode 100644 src/test/modules/test_btree_merge/meson.build
create mode 100644 src/test/modules/test_btree_merge/sql/test_btree_merge.sql
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.c
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.control
diff --git a/.gitignore b/.gitignore
index 4e911395fe3..ac1f95d9cf0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,11 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+# hidden files (e.g. .dbdata, .install, good practice to test locally in isolation)
+.*
+
+# Test output
+**/regression.diffs
+**/regression.out
+**/results/
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 0daf640af96..72053cefdaa 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -16,6 +16,7 @@ OBJS = \
nbtcompare.o \
nbtdedup.o \
nbtinsert.o \
+ nbtmergescan.o \
nbtpage.o \
nbtpreprocesskeys.o \
nbtreadpage.o \
diff --git a/src/backend/access/nbtree/meson.build b/src/backend/access/nbtree/meson.build
index 812f067e710..1016fea62d5 100644
--- a/src/backend/access/nbtree/meson.build
+++ b/src/backend/access/nbtree/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'nbtcompare.c',
'nbtdedup.c',
'nbtinsert.c',
+ 'nbtmergescan.c',
'nbtpage.c',
'nbtpreprocesskeys.c',
'nbtreadpage.c',
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
new file mode 100644
index 00000000000..70828dc73d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -0,0 +1,457 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtmergescan.c
+ * B-Tree merge scan for efficient evaluation of IN-list queries
+ *
+ * This module implements a K-way merge scan for B-tree indexes, optimized
+ * for queries of the form:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ *
+ * The algorithm maintains a min-heap of cursors, one per prefix value.
+ * Each cursor tracks its position within the index for that prefix.
+ * Tuples are returned in suffix order by repeatedly extracting the
+ * minimum from the heap.
+ *
+ * Target behavior: Access at most N + K - 1 index tuples for LIMIT N.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtmergescan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Forward declarations of static functions */
+static int bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg);
+static bool bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull);
+static bool bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor);
+static Datum bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull);
+
+
+/*
+ * bt_merge_heap_cmp
+ * Compare two cursors by their current sort key (suffix value).
+ *
+ * When sort keys are equal, uses prefix value as tiebreaker for
+ * deterministic ordering (ORDER BY suffix, prefix).
+ *
+ * Returns positive if a > b (pairingheap is a max-heap, we want min-heap
+ * behavior so we invert the comparison).
+ */
+static int
+bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg)
+{
+ BTMergeScanState *state = (BTMergeScanState *) arg;
+ BTMergeCursor *cursor_a = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) a);
+ BTMergeCursor *cursor_b = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) b);
+ Datum key_a = cursor_a->sort_key;
+ Datum key_b = cursor_b->sort_key;
+ bool null_a = cursor_a->sort_key_isnull;
+ bool null_b = cursor_b->sort_key_isnull;
+ int32 cmp;
+
+ /* Handle NULLs - NULLs sort last (NULLS LAST default for ASC) */
+ if (null_a && null_b)
+ return 0;
+ if (null_a)
+ return -1; /* a is NULL, comes after b */
+ if (null_b)
+ return 1; /* b is NULL, comes after a */
+
+ /* Compare using the suffix column's comparison function */
+ cmp = DatumGetInt32(FunctionCall2Coll(&state->suffix_cmp,
+ state->suffix_collation,
+ key_a, key_b));
+
+ /*
+ * Use prefix value as tiebreaker for deterministic ordering.
+ * This ensures ORDER BY suffix, prefix behavior.
+ */
+ if (cmp == 0)
+ {
+ /* Compare prefix values (assumes pass-by-value int4 for now) */
+ int32 prefix_a = DatumGetInt32(cursor_a->prefix_value);
+ int32 prefix_b = DatumGetInt32(cursor_b->prefix_value);
+
+ if (prefix_a < prefix_b)
+ cmp = -1;
+ else if (prefix_a > prefix_b)
+ cmp = 1;
+ }
+
+ /* Negate for min-heap behavior */
+ return -cmp;
+}
+
+
+/*
+ * bt_merge_init
+ * Initialize a merge scan state.
+ *
+ * Creates the merge state with one cursor per prefix value.
+ * The cursors will be positioned at their first matching tuples
+ * when bt_merge_getnext is first called.
+ */
+BTMergeScanState *
+bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation)
+{
+ BTMergeScanState *state;
+ MemoryContext merge_context;
+ MemoryContext old_context;
+ int i;
+
+ /* Create memory context for merge scan allocations */
+ merge_context = AllocSetContextCreate(CurrentMemoryContext,
+ "BTMergeScan",
+ ALLOCSET_DEFAULT_SIZES);
+ old_context = MemoryContextSwitchTo(merge_context);
+
+ /* Allocate main state structure */
+ state = palloc0(sizeof(BTMergeScanState));
+ state->merge_context = merge_context;
+ state->num_cursors = num_prefixes;
+ state->active_cursors = 0;
+ state->prefix_attno = prefix_attno;
+ state->suffix_attno = suffix_attno;
+ state->suffix_collation = suffix_collation;
+ state->direction = ForwardScanDirection;
+ state->initialized = false;
+ state->tuples_accessed = 0;
+
+ /* Set up suffix comparison function */
+ fmgr_info(suffix_cmp_oid, &state->suffix_cmp);
+
+ /* Allocate cursor array */
+ state->cursors = palloc0(num_prefixes * sizeof(BTMergeCursor));
+
+ /* Initialize cursor metadata (not positioned yet) */
+ for (i = 0; i < num_prefixes; i++)
+ {
+ BTMergeCursor *cursor = &state->cursors[i];
+
+ cursor->cursor_id = i;
+ cursor->prefix_value = datumCopy(prefix_values[i], true, sizeof(Datum));
+ cursor->prefix_isnull = prefix_nulls[i];
+ cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
+ cursor->sort_key_isnull = true;
+ BTScanPosInvalidate(cursor->pos);
+ cursor->tuples = NULL;
+ }
+
+ /* Initialize the merge heap */
+ state->merge_heap = pairingheap_allocate(bt_merge_heap_cmp, state);
+
+ MemoryContextSwitchTo(old_context);
+
+ return state;
+}
+
+
+/*
+ * bt_merge_getnext
+ * Get the next tuple from the merge scan.
+ *
+ * Returns true if a tuple was found, false if scan is exhausted.
+ * The tuple's TID is stored in scan->xs_heaptid.
+ */
+bool
+bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTMergeScanState *state = so->mergeState;
+ BTMergeCursor *cursor;
+ pairingheap_node *node;
+ int i;
+
+ if (state == NULL)
+ return false;
+
+ /* Initialize cursors on first call */
+ if (!state->initialized)
+ {
+ state->initialized = true;
+ state->direction = dir;
+
+ for (i = 0; i < state->num_cursors; i++)
+ {
+ BTMergeCursor *c = &state->cursors[i];
+
+ if (!c->exhausted &&
+ bt_merge_cursor_init(state, scan, c,
+ c->prefix_value, c->prefix_isnull))
+ {
+ /* Cursor has at least one tuple, add to heap */
+ pairingheap_add(state->merge_heap, &c->ph_node);
+ state->active_cursors++;
+ }
+ }
+ }
+
+ /* Get the cursor with the smallest suffix value */
+ if (pairingheap_is_empty(state->merge_heap))
+ return false;
+
+ node = pairingheap_remove_first(state->merge_heap);
+ cursor = pairingheap_container(BTMergeCursor, ph_node, node);
+
+ /* Set up the heap TID from the current cursor position */
+ Assert(BTScanPosIsValid(cursor->pos));
+ scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+
+ /* Advance cursor to next tuple */
+ if (bt_merge_cursor_advance(state, scan, cursor))
+ {
+ /* Cursor still has tuples, re-add to heap */
+ pairingheap_add(state->merge_heap, &cursor->ph_node);
+ }
+ else
+ {
+ /* Cursor exhausted */
+ state->active_cursors--;
+ }
+
+ return true;
+}
+
+
+/*
+ * bt_merge_end
+ * Clean up merge scan state.
+ */
+void
+bt_merge_end(BTMergeScanState *state)
+{
+ if (state == NULL)
+ return;
+
+ /* Free the memory context, which frees all allocations */
+ MemoryContextDelete(state->merge_context);
+}
+
+
+/*
+ * bt_merge_cursor_init
+ * Initialize a cursor and position it at the first matching tuple.
+ *
+ * Returns true if the cursor found at least one matching tuple.
+ */
+static bool
+bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found;
+
+ if (prefix_isnull)
+ {
+ cursor->exhausted = true;
+ return false;
+ }
+
+ /*
+ * Modify the scan key to use this cursor's prefix value.
+ * We reuse the scan's existing key infrastructure.
+ */
+ for (int i = 0; i < so->numberOfKeys; i++)
+ {
+ if (so->keyData[i].sk_attno == state->prefix_attno)
+ {
+ so->keyData[i].sk_argument = prefix_value;
+ so->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
+ break;
+ }
+ }
+
+ /* Invalidate current position to force _bt_first */
+ BTScanPosInvalidate(so->currPos);
+
+ /* Disable array key handling for this cursor's scan */
+ so->numArrayKeys = 0;
+
+ /* Position at first matching tuple */
+ found = _bt_first(scan, state->direction);
+
+ if (found)
+ {
+ /* Copy position to cursor */
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ /* Extract the sort key for heap ordering */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ cursor->exhausted = false;
+
+ /* Count this as a tuple access */
+ state->tuples_accessed++;
+
+ /* Invalidate main scan position */
+ BTScanPosInvalidate(so->currPos);
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_cursor_advance
+ * Advance a cursor to its next tuple.
+ *
+ * Returns true if the cursor now points to a valid tuple, false if exhausted.
+ */
+static bool
+bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+
+ if (cursor->exhausted)
+ return false;
+
+ /* Try to move to next tuple within current page's items array */
+ if (state->direction == ForwardScanDirection)
+ {
+ if (cursor->pos.itemIndex < cursor->pos.lastItem)
+ {
+ cursor->pos.itemIndex++;
+ found = true;
+ }
+ }
+ else
+ {
+ if (cursor->pos.itemIndex > cursor->pos.firstItem)
+ {
+ cursor->pos.itemIndex--;
+ found = true;
+ }
+ }
+
+ if (!found)
+ {
+ /*
+ * Current page exhausted. Use _bt_next to get the next page.
+ * We swap our cursor's position into the scan's currPos,
+ * call _bt_next, then swap back.
+ */
+ BTScanPosData save_pos;
+
+ memcpy(&save_pos, &so->currPos, sizeof(BTScanPosData));
+ memcpy(&so->currPos, &cursor->pos, sizeof(BTScanPosData));
+
+ found = _bt_next(scan, state->direction);
+
+ if (found)
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ memcpy(&so->currPos, &save_pos, sizeof(BTScanPosData));
+ }
+
+ if (found)
+ {
+ /* Extract new sort key */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ state->tuples_accessed++;
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_extract_sortkey
+ * Extract the sort key (suffix column value) from the current tuple.
+ */
+static Datum
+bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull)
+{
+ Relation rel = scan->indexRelation;
+ Buffer buf;
+ Page page;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ TupleDesc tupdesc;
+ Datum result;
+
+ if (cursor->pos.currPage == InvalidBlockNumber)
+ {
+ *isnull = true;
+ return (Datum) 0;
+ }
+
+ /* Read the page */
+ buf = ReadBuffer(rel, cursor->pos.currPage);
+ LockBuffer(buf, BT_READ);
+ page = BufferGetPage(buf);
+
+ offnum = cursor->pos.items[cursor->pos.itemIndex].indexOffset;
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ tupdesc = RelationGetDescr(rel);
+
+ /* Extract the suffix column value */
+ result = index_getattr(itup, state->suffix_attno, tupdesc, isnull);
+
+ /* Copy pass-by-reference values before releasing buffer */
+ if (!*isnull)
+ {
+ Form_pg_attribute attr = TupleDescAttr(tupdesc, state->suffix_attno - 1);
+
+ if (!attr->attbyval)
+ result = datumCopy(result, attr->attbyval, attr->attlen);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ return result;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 77224859685..0d4e7440760 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -20,6 +20,7 @@
#include "catalog/pg_am_d.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
+#include "lib/pairingheap.h"
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
@@ -1050,6 +1051,49 @@ typedef struct BTArrayKeyInfo
ScanKey high_compare; /* array's < or <= upper bound */
} BTArrayKeyInfo;
+/*
+ * BTMergeCursor - tracks scan state for one prefix value in merge scan
+ *
+ * Each cursor maintains its own position within the index for a specific
+ * prefix value. Cursors are organized in a min-heap ordered by their
+ * current suffix key value for efficient K-way merge.
+ */
+typedef struct BTMergeCursor
+{
+ pairingheap_node ph_node; /* pairing heap node for merge */
+ int cursor_id; /* index in merge state's cursors array */
+ Datum prefix_value; /* the prefix value for this sub-scan */
+ bool prefix_isnull; /* is prefix value NULL? */
+ Datum sort_key; /* current tuple's sort key (suffix) */
+ bool sort_key_isnull;/* is sort key NULL? */
+ bool exhausted; /* no more tuples for this prefix */
+ BTScanPosData pos; /* current position in index */
+ char *tuples; /* tuple storage workspace (BLCKSZ) */
+} BTMergeCursor;
+
+/*
+ * BTMergeScanState - state for K-way merge scan
+ *
+ * This structure manages multiple cursors for a merge scan, allowing
+ * lazy evaluation of queries like:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ */
+typedef struct BTMergeScanState
+{
+ int num_cursors; /* number of prefix values (K) */
+ int active_cursors; /* cursors not yet exhausted */
+ BTMergeCursor *cursors; /* array of cursors */
+ pairingheap *merge_heap; /* min-heap ordered by sort_key */
+ int prefix_attno; /* attribute number of prefix column (1-based) */
+ int suffix_attno; /* attribute number of suffix column (1-based) */
+ FmgrInfo suffix_cmp; /* comparison function for suffix */
+ Oid suffix_collation; /* collation for suffix comparison */
+ ScanDirection direction; /* scan direction */
+ bool initialized; /* have cursors been initialized? */
+ MemoryContext merge_context;/* memory context for allocations */
+ int64 tuples_accessed;/* count of index tuples accessed */
+} BTMergeScanState;
+
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
@@ -1089,6 +1133,12 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /*
+ * Merge scan state, if using merge scan optimization.
+ * NULL if not using merge scan.
+ */
+ BTMergeScanState *mergeState;
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -1334,4 +1384,18 @@ extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+/*
+ * prototypes for functions in nbtmergescan.c
+ */
+extern BTMergeScanState *bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation);
+extern bool bt_merge_getnext(IndexScanDesc scan, ScanDirection dir);
+extern void bt_merge_end(BTMergeScanState *state);
+
#endif /* NBTREE_H */
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2634a519935..b7b802bfdde 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -18,6 +18,7 @@ subdir('ssl_passphrase_callback')
subdir('test_aio')
subdir('test_binaryheap')
subdir('test_bitmapset')
+subdir('test_btree_merge')
subdir('test_bloomfilter')
subdir('test_cloexec')
subdir('test_copy_callbacks')
diff --git a/src/test/modules/test_btree_merge/Makefile b/src/test/modules/test_btree_merge/Makefile
new file mode 100644
index 00000000000..540416a2c91
--- /dev/null
+++ b/src/test/modules/test_btree_merge/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_btree_merge/Makefile
+
+MODULE_big = test_btree_merge
+OBJS = \
+ $(WIN32RES) \
+ test_btree_merge.o
+
+PGFILEDESC = "test_btree_merge - test code for btree merge scan"
+
+EXTENSION = test_btree_merge
+DATA = test_btree_merge--1.0.sql
+
+REGRESS = test_btree_merge
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_btree_merge
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_btree_merge/expected/test_btree_merge.out b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
new file mode 100644
index 00000000000..baf4d7937e0
--- /dev/null
+++ b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
@@ -0,0 +1,243 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+CREATE EXTENSION test_btree_merge;
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+ test_name
+----------------------------------
+ Test 1: Basic integer merge scan
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+SELECT 'Test 2: More prefix values' AS test_name;
+ test_name
+----------------------------
+ Test 2: More prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 3 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+SELECT 'Test 3: Single prefix value' AS test_name;
+ test_name
+-----------------------------
+ Test 3: Single prefix value
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 6 | 5
+(1 row)
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+SELECT 'Test 4: Large LIMIT' AS test_name;
+ test_name
+---------------------
+ Test 4: Large LIMIT
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 153 | 153 | 153
+(1 row)
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+ test_name
+--------------------------------------
+ Test 5: Non-contiguous prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+ test_name
+--------------------------
+ Test 6: Timestamp suffix
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+SELECT 'Test 7: All users timestamp' AS test_name;
+ test_name
+-----------------------------
+ Test 7: All users timestamp
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 10 | 15 | 14
+(1 row)
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+SELECT 'Test 8: Correctness verification' AS test_name;
+ test_name
+----------------------------------
+ Test 8: Correctness verification
+(1 row)
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+ status | rn | prefix_col | suffix_col | expected_prefix | expected_suffix
+--------+----+------------+------------+-----------------+-----------------
+(0 rows)
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/meson.build b/src/test/modules/test_btree_merge/meson.build
new file mode 100644
index 00000000000..665d6cf443e
--- /dev/null
+++ b/src/test/modules/test_btree_merge/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_btree_merge_sources = files(
+ 'test_btree_merge.c',
+)
+
+if host_system == 'windows'
+ test_btree_merge_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_btree_merge',
+ '--FILEDESC', 'test_btree_merge - test code for btree merge scan',])
+endif
+
+test_btree_merge = shared_module('test_btree_merge',
+ test_btree_merge_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_btree_merge
+
+test_install_data += files(
+ 'test_btree_merge.control',
+ 'test_btree_merge--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_btree_merge',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_btree_merge',
+ ],
+ },
+}
diff --git a/src/test/modules/test_btree_merge/sql/test_btree_merge.sql b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
new file mode 100644
index 00000000000..5828b343b34
--- /dev/null
+++ b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
@@ -0,0 +1,207 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+
+CREATE EXTENSION test_btree_merge;
+
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+
+
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 2: More prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+
+SELECT 'Test 3: Single prefix value' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+
+SELECT 'Test 4: Large LIMIT' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+
+SELECT 'Test 7: All users timestamp' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+
+SELECT 'Test 8: Correctness verification' AS test_name;
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
new file mode 100644
index 00000000000..9872947d7d7
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
@@ -0,0 +1,43 @@
+/* src/test/modules/test_btree_merge/test_btree_merge--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_btree_merge" to load this file. \quit
+
+-- Test merge scan with integer columns
+CREATE FUNCTION test_btree_merge_scan_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Fetch actual rows from merge scan (for correctness verification)
+CREATE FUNCTION test_btree_merge_fetch_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ prefix_col int4,
+ suffix_col int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Test merge scan with timestamp suffix
+CREATE FUNCTION test_btree_merge_scan_ts(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start timestamp,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.c b/src/test/modules/test_btree_merge/test_btree_merge.c
new file mode 100644
index 00000000000..78b22130ecf
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.c
@@ -0,0 +1,389 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_btree_merge.c
+ * Unit tests for B-tree Merge Scan implementation
+ *
+ * This module provides SQL-callable functions to directly test the
+ * merge scan algorithm without going through the planner.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
+#include "commands/defrem.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define MAX_RESULTS 10000
+
+/*
+ * MergeScanResult - holds results from a merge scan execution
+ */
+typedef struct MergeScanResult
+{
+ int tuples_returned;
+ int64 tuples_accessed;
+ int num_prefixes;
+ int limit_count;
+ /* For fetch function: collected row data */
+ int32 *prefixes;
+ int32 *suffixes;
+} MergeScanResult;
+
+/*
+ * do_merge_scan - common merge scan execution
+ *
+ * Performs a merge scan with the given parameters and collects results.
+ * If collect_rows is true, fetches and stores actual row data.
+ */
+static void
+do_merge_scan(const char *table_name,
+ const char *index_name,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ Datum suffix_start,
+ Oid suffix_type,
+ RegProcedure suffix_eq_proc,
+ RegProcedure suffix_ge_proc,
+ int limit_count,
+ bool collect_rows,
+ MergeScanResult *result)
+{
+ Oid table_oid;
+ Oid index_oid;
+ Relation heap_rel;
+ Relation index_rel;
+ IndexScanDesc scan;
+ BTScanOpaque so;
+ BTMergeScanState *merge_state;
+ Snapshot snapshot;
+ Oid suffix_cmp_oid;
+ Oid opfamily;
+ const char *opfamily_name;
+ int tuples_returned = 0;
+ int max_results;
+
+ /* Determine operator family based on suffix type */
+ if (suffix_type == INT4OID)
+ opfamily_name = "integer_ops";
+ else if (suffix_type == TIMESTAMPOID)
+ opfamily_name = "datetime_ops";
+ else
+ elog(ERROR, "unsupported suffix type: %u", suffix_type);
+
+ /* Look up table and index */
+ table_oid = RelnameGetRelid(table_name);
+ if (!OidIsValid(table_oid))
+ elog(ERROR, "table \"%s\" does not exist", table_name);
+
+ index_oid = RelnameGetRelid(index_name);
+ if (!OidIsValid(index_oid))
+ elog(ERROR, "index \"%s\" does not exist", index_name);
+
+ /* Open relations */
+ heap_rel = table_open(table_oid, AccessShareLock);
+ index_rel = index_open(index_oid, AccessShareLock);
+
+ /* Get comparison function for suffix type */
+ opfamily = get_opfamily_oid(BTREE_AM_OID,
+ list_make1(makeString(pstrdup(opfamily_name))),
+ false);
+ suffix_cmp_oid = get_opfamily_proc(opfamily, suffix_type, suffix_type,
+ BTORDER_PROC);
+ if (!OidIsValid(suffix_cmp_oid))
+ elog(ERROR, "could not find comparison function for type %u", suffix_type);
+
+ /* Begin index scan */
+ snapshot = GetActiveSnapshot();
+ scan = index_beginscan(heap_rel, index_rel, snapshot, NULL, 2, 0);
+
+ /* Set up scan keys */
+ {
+ ScanKeyData keys[2];
+
+ ScanKeyInit(&keys[0], 1, BTEqualStrategyNumber, suffix_eq_proc,
+ prefix_values[0]);
+ ScanKeyInit(&keys[1], 2, BTGreaterEqualStrategyNumber, suffix_ge_proc,
+ suffix_start);
+ index_rescan(scan, keys, 2, NULL, 0);
+ }
+
+ so = (BTScanOpaque) scan->opaque;
+
+ /* Initialize merge scan */
+ merge_state = bt_merge_init(scan, prefix_values, prefix_nulls,
+ num_prefixes, 1, 2, suffix_cmp_oid, InvalidOid);
+ so->mergeState = merge_state;
+
+ /* Execute scan */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+
+ while (tuples_returned < max_results)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (!bt_merge_getnext(scan, ForwardScanDirection))
+ break;
+
+ if (collect_rows && result->prefixes != NULL)
+ {
+ /* Fetch heap tuple to get actual values */
+ HeapTupleData heapTuple;
+ Buffer heapBuffer;
+ bool isnull;
+
+ heapTuple.t_self = scan->xs_heaptid;
+ if (heap_fetch(heap_rel, snapshot, &heapTuple, &heapBuffer, false))
+ {
+ result->prefixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 1,
+ RelationGetDescr(heap_rel), &isnull));
+ result->suffixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 2,
+ RelationGetDescr(heap_rel), &isnull));
+ ReleaseBuffer(heapBuffer);
+ }
+ }
+
+ tuples_returned++;
+
+ if (tuples_returned >= MAX_RESULTS)
+ {
+ elog(WARNING, "merge scan hit safety limit of %d tuples", MAX_RESULTS);
+ break;
+ }
+ }
+
+ /* Collect results before cleanup */
+ result->tuples_returned = tuples_returned;
+ result->tuples_accessed = merge_state->tuples_accessed;
+ result->num_prefixes = num_prefixes;
+ result->limit_count = limit_count;
+
+ /* Clean up */
+ bt_merge_end(merge_state);
+ so->mergeState = NULL;
+ index_endscan(scan);
+ index_close(index_rel, AccessShareLock);
+ table_close(heap_rel, AccessShareLock);
+}
+
+/*
+ * build_stats_result - build the stats result tuple
+ */
+static Datum
+build_stats_result(FunctionCallInfo fcinfo, MergeScanResult *result)
+{
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+ HeapTuple tuple;
+ int max_required_fetches;
+
+ /* Calculate expected max fetches */
+ if (result->tuples_returned < result->limit_count)
+ max_required_fetches = result->tuples_returned;
+ else
+ max_required_fetches = result->limit_count + result->num_prefixes - 1;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context "
+ "that cannot accept type record")));
+
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ values[0] = Int32GetDatum(result->tuples_returned);
+ values[1] = Int32GetDatum((int32) result->tuples_accessed);
+ values[2] = Int32GetDatum(max_required_fetches);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+
+/*
+ * test_btree_merge_scan_int - test merge scan with integer columns
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_int);
+
+Datum
+test_btree_merge_scan_int(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_scan_ts - test merge scan with timestamp suffix
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_ts);
+
+Datum
+test_btree_merge_scan_ts(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ Timestamp suffix_start = PG_GETARG_TIMESTAMP(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ TimestampGetDatum(suffix_start), TIMESTAMPOID,
+ F_INT4EQ, F_TIMESTAMP_GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_fetch_int - fetch actual rows from merge scan
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_fetch_int);
+
+Datum
+test_btree_merge_fetch_int(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+
+ typedef struct
+ {
+ int32 *prefixes;
+ int32 *suffixes;
+ int num_results;
+ int current_idx;
+ } FetchContext;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MemoryContext oldcontext;
+ FetchContext *fctx;
+ MergeScanResult result = {0};
+ TupleDesc tupdesc;
+ int max_results;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ /* Allocate result storage */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+ fctx = palloc(sizeof(FetchContext));
+ fctx->prefixes = palloc(max_results * sizeof(int32));
+ fctx->suffixes = palloc(max_results * sizeof(int32));
+ fctx->current_idx = 0;
+
+ /* Point result to our storage */
+ result.prefixes = fctx->prefixes;
+ result.suffixes = fctx->suffixes;
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, true, &result);
+
+ fctx->num_results = result.tuples_returned;
+
+ /* Build result tuple descriptor */
+ tupdesc = CreateTemplateTupleDesc(2);
+ TupleDescInitEntry(tupdesc, 1, "prefix_col", INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, 2, "suffix_col", INT4OID, -1, 0);
+ funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+ funcctx->user_fctx = fctx;
+
+ MemoryContextSwitchTo(oldcontext);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ {
+ FetchContext *fctx = funcctx->user_fctx;
+
+ if (fctx->current_idx < fctx->num_results)
+ {
+ Datum values[2];
+ bool nulls[2] = {false, false};
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->prefixes[fctx->current_idx]);
+ values[1] = Int32GetDatum(fctx->suffixes[fctx->current_idx]);
+ fctx->current_idx++;
+
+ tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+ }
+}
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.control b/src/test/modules/test_btree_merge/test_btree_merge.control
new file mode 100644
index 00000000000..f8146bd0f74
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.control
@@ -0,0 +1,5 @@
+# test_btree_merge extension
+comment = 'Unit tests for B-tree merge scan'
+default_version = '1.0'
+module_pathname = '$libdir/test_btree_merge'
+relocatable = true
--
2.40.0
[application/octet-stream] 0003-MERGE-SCAN-Planner-integration.patch (27.6K, 4-0003-MERGE-SCAN-Planner-integration.patch)
download | inline diff:
From ad123a3f8da3d95262b2553e90dd9c8fbb8d2335 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Thu, 5 Feb 2026 05:09:48 +0000
Subject: [PATCH 3/3] [MERGE-SCAN] Planner integration
---
src/backend/access/index/genam.c | 2 +
src/backend/access/nbtree/nbtmergescan.c | 60 ++++++-
src/backend/access/nbtree/nbtree.c | 129 +++++++++++++++
src/backend/executor/nodeIndexonlyscan.c | 5 +-
src/backend/executor/nodeIndexscan.c | 11 ++
src/backend/optimizer/path/indxpath.c | 188 ++++++++++++++++++++++
src/backend/optimizer/plan/createplan.c | 8 +
src/backend/optimizer/util/pathnode.c | 2 +
src/include/access/relscan.h | 3 +
src/include/nodes/execnodes.h | 5 +
src/include/nodes/pathnodes.h | 1 +
src/include/nodes/plannodes.h | 4 +
src/test/regress/expected/btree_merge.out | 16 +-
src/test/regress/sql/btree_merge.sql | 9 ++
14 files changed, 437 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62c..53615fb08d2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ scan->xs_num_merge_prefixes = 0;
+
return scan;
}
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
index 70828dc73d3..eda1e683525 100644
--- a/src/backend/access/nbtree/nbtmergescan.c
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -27,6 +27,7 @@
#include "access/relscan.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/bufmgr.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
@@ -169,7 +170,8 @@ bt_merge_init(IndexScanDesc scan,
cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
cursor->sort_key_isnull = true;
BTScanPosInvalidate(cursor->pos);
- cursor->tuples = NULL;
+ /* Allocate tuple workspace for index-only scans */
+ cursor->tuples = palloc(BLCKSZ);
}
/* Initialize the merge heap */
@@ -219,6 +221,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
state->active_cursors++;
}
}
+
+ /*
+ * Track internal tuple reads for stats. We read active_cursors tuples
+ * during initialization. One of these will be returned first and
+ * counted by index_getnext_tid, so we count (active_cursors - 1) here.
+ */
+ if (state->active_cursors > 1)
+ pgstat_count_index_tuples(scan->indexRelation,
+ state->active_cursors - 1);
}
/* Get the cursor with the smallest suffix value */
@@ -228,9 +239,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
node = pairingheap_remove_first(state->merge_heap);
cursor = pairingheap_container(BTMergeCursor, ph_node, node);
- /* Set up the heap TID from the current cursor position */
+ /* Set up the heap TID and index tuple from the current cursor position */
Assert(BTScanPosIsValid(cursor->pos));
- scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+ {
+ BTScanPosItem *currItem = &cursor->pos.items[cursor->pos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ /* For index-only scans, set the index tuple pointer */
+ if (cursor->tuples)
+ scan->xs_itup = (IndexTuple) (cursor->tuples + currItem->tupleOffset);
+ }
/* Advance cursor to next tuple */
if (bt_merge_cursor_advance(state, scan, cursor))
@@ -255,9 +272,23 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
void
bt_merge_end(BTMergeScanState *state)
{
+ int i;
+
if (state == NULL)
return;
+ /* Release any buffer pins held by cursors */
+ for (i = 0; i < state->num_cursors; i++)
+ {
+ BTMergeCursor *cursor = &state->cursors[i];
+
+ if (BTScanPosIsValid(cursor->pos) && BufferIsValid(cursor->pos.buf))
+ {
+ ReleaseBuffer(cursor->pos.buf);
+ cursor->pos.buf = InvalidBuffer;
+ }
+ }
+
/* Free the memory context, which frees all allocations */
MemoryContextDelete(state->merge_context);
}
@@ -302,8 +333,14 @@ bt_merge_cursor_init(BTMergeScanState *state,
/* Invalidate current position to force _bt_first */
BTScanPosInvalidate(so->currPos);
- /* Disable array key handling for this cursor's scan */
+ /*
+ * Disable array key handling for this cursor's scan.
+ * We need to clear both numArrayKeys and needPrimScan to avoid
+ * assertions in _bt_readfirstpage that expect array keys when
+ * needPrimScan is set.
+ */
so->numArrayKeys = 0;
+ so->needPrimScan = false;
/* Position at first matching tuple */
found = _bt_first(scan, state->direction);
@@ -313,6 +350,16 @@ bt_merge_cursor_init(BTMergeScanState *state,
/* Copy position to cursor */
memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+ /*
+ * Copy the tuple data for index-only scans.
+ * The tuple workspace contains copies of index tuples referenced
+ * by items in currPos.
+ */
+ if (so->currTuples && so->currPos.nextTupleOffset > 0)
+ {
+ memcpy(cursor->tuples, so->currTuples, so->currPos.nextTupleOffset);
+ }
+
/* Extract the sort key for heap ordering */
cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
&cursor->sort_key_isnull);
@@ -390,6 +437,11 @@ bt_merge_cursor_advance(BTMergeScanState *state,
if (found)
{
+ /*
+ * Don't count here - the advanced-to tuple will be returned later
+ * and counted by index_getnext_tid at that time.
+ */
+
/* Extract new sort key */
cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
&cursor->sort_key_isnull);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee657d..0e55c4874b4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,6 +21,8 @@
#include "access/nbtree.h"
#include "access/relscan.h"
#include "access/stratnum.h"
+#include "catalog/pg_amop.h"
+#include "utils/array.h"
#include "commands/progress.h"
#include "commands/vacuum.h"
#include "nodes/execnodes.h"
@@ -34,6 +36,7 @@
#include "utils/datum.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -98,6 +101,8 @@ static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btsca
BTScanOpaque so);
static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
BTScanOpaque so);
+static bool bt_init_merge_scan_from_keys(IndexScanDesc scan);
+
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
@@ -221,6 +226,106 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+/*
+ * bt_init_merge_scan_from_keys
+ * Initialize merge scan state from the preprocessed scan keys.
+ *
+ * Returns true if merge scan was successfully initialized.
+ * Returns false if merge scan cannot be used (e.g., no suitable array key).
+ */
+static bool
+bt_init_merge_scan_from_keys(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey arrayKey = NULL;
+ ArrayType *arr;
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ int prefix_attno;
+ int suffix_attno;
+ Oid suffix_cmp_oid;
+ Oid suffix_collation;
+ Oid opfamily;
+ Oid elemtype;
+ int16 elemlen;
+ bool elembyval;
+ char elemalign;
+ int i;
+
+ /* Look for SK_SEARCHARRAY on first column in the raw scan keys */
+ for (i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey sk = &scan->keyData[i];
+
+ if ((sk->sk_flags & SK_SEARCHARRAY) &&
+ sk->sk_attno == 1 &&
+ sk->sk_strategy == BTEqualStrategyNumber)
+ {
+ arrayKey = sk;
+ break;
+ }
+ }
+
+ if (arrayKey == NULL)
+ return false;
+
+ /* Extract array values from the scan key */
+ arr = DatumGetArrayTypeP(arrayKey->sk_argument);
+ num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+
+ if (num_prefixes < 2)
+ return false;
+
+ /* Get array element type info */
+ elemtype = ARR_ELEMTYPE(arr);
+ get_typlenbyvalalign(elemtype, &elemlen, &elembyval, &elemalign);
+
+ /* Deconstruct the array into individual elements */
+ deconstruct_array(arr, elemtype, elemlen, elembyval, elemalign,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ /* Attribute numbers (1-based) */
+ prefix_attno = 1;
+ suffix_attno = 2;
+
+ /* Get the opfamily from the index */
+ opfamily = rel->rd_opfamily[suffix_attno - 1];
+
+ /* Get collation from the suffix column */
+ suffix_collation = TupleDescAttr(itupdesc, suffix_attno - 1)->attcollation;
+
+ /* Get the comparison function OID for the suffix column */
+ suffix_cmp_oid = get_opfamily_proc(opfamily,
+ TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+ TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+ BTORDER_PROC);
+
+ if (!OidIsValid(suffix_cmp_oid))
+ {
+ pfree(prefix_values);
+ pfree(prefix_nulls);
+ return false;
+ }
+
+ /* Initialize the merge scan state */
+ so->mergeState = bt_merge_init(scan,
+ prefix_values,
+ prefix_nulls,
+ num_prefixes,
+ prefix_attno,
+ suffix_attno,
+ suffix_cmp_oid,
+ suffix_collation);
+
+ pfree(prefix_values);
+ pfree(prefix_nulls);
+
+ return (so->mergeState != NULL);
+}
+
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -235,6 +340,24 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
/* btree indexes are never lossy */
scan->xs_recheck = false;
+ /*
+ * Check if merge scan optimization should be used.
+ * Initialize merge scan state on first call if needed.
+ */
+ if (scan->xs_num_merge_prefixes > 0 && so->mergeState == NULL)
+ {
+ if (!bt_init_merge_scan_from_keys(scan))
+ {
+ /* Merge scan init failed, fall through to regular scan */
+ scan->xs_num_merge_prefixes = 0;
+ }
+ }
+
+ /* Use merge scan if initialized */
+ /* Use merge scan if initialized */
+ if (so->mergeState != NULL)
+ return bt_merge_getnext(scan, dir);
+
/* Each loop iteration performs another primitive index scan */
do
{
@@ -365,6 +488,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ /* Initialize merge scan state to NULL */
+ so->mergeState = NULL;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -486,6 +612,9 @@ btendscan(IndexScanDesc scan)
pfree(so->killedItems);
if (so->currTuples != NULL)
pfree(so->currTuples);
+ /* Clean up merge scan state */
+ if (so->mergeState != NULL)
+ bt_merge_end(so->mergeState);
/* so->markTuples should not be pfree'd, see btrescan */
pfree(so);
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d09374517..70483c4e767 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -98,6 +98,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
/* Set it up for index-only scan */
node->ioss_ScanDesc->xs_want_itup = true;
@@ -615,7 +616,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ioss_RuntimeKeysReady = false;
indexstate->ioss_RuntimeKeys = NULL;
indexstate->ioss_NumRuntimeKeys = 0;
-
+ indexstate->ioss_NumMergePrefixes = node->num_merge_prefixes;
/*
* build the index scan keys from the index qualification
*/
@@ -790,6 +791,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_NumOrderByKeys,
piscan);
node->ioss_ScanDesc->xs_want_itup = true;
+ node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
node->ioss_VMBuffer = InvalidBuffer;
/*
@@ -856,6 +858,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_NumOrderByKeys,
piscan);
node->ioss_ScanDesc->xs_want_itup = true;
+ node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff04c..9e62cacd2d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,7 @@ IndexNext(IndexScanState *node)
node->iss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and
* pass the scankeys to the index AM.
@@ -211,6 +212,8 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
/*
* If no run-time keys to calculate or they are ready, go ahead and
* pass the scankeys to the index AM.
@@ -1086,6 +1089,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->iss_RuntimeContext = NULL;
}
+ /*
+ * Initialize merge scan state from plan node
+ */
+ indexstate->iss_NumMergePrefixes = node->num_merge_prefixes;
+
/*
* all done.
*/
@@ -1725,6 +1733,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_NumOrderByKeys,
piscan);
+ node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
* the scankeys to the index AM.
@@ -1789,6 +1799,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_NumOrderByKeys,
piscan);
+ node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
* the scankeys to the index AM.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 67d9dc35f44..44b79f91335 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -16,6 +16,7 @@
#include "postgres.h"
#include "access/stratnum.h"
+#include "utils/array.h"
#include "access/sysattr.h"
#include "access/transam.h"
#include "catalog/pg_am.h"
@@ -102,6 +103,8 @@ static bool eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
List **bitindexpaths);
+static void consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+ IndexOptInfo *index, IndexClauseSet *clauses);
static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
@@ -770,6 +773,191 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
+
+ /*
+ * Consider merge scan optimization for queries with:
+ * - ScalarArrayOpExpr (IN clause) on first index column
+ * - ORDER BY on second column (different from index leading column)
+ * - Optionally LIMIT
+ */
+ consider_merge_scan_path(root, rel, index, clauses);
+}
+
+/*
+ * consider_merge_scan_path
+ * Check if this index can provide a merge scan path for queries of the form:
+ * WHERE prefix IN (...) AND suffix >= b ORDER BY suffix, prefix LIMIT N
+ *
+ * Merge scan allows lazily producing output sorted by (suffix, prefix) from
+ * an index on (prefix, suffix) by doing a K-way merge of K separate scans.
+ */
+static void
+consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+ IndexOptInfo *index, IndexClauseSet *clauses)
+{
+ IndexPath *ipath;
+ List *index_clauses;
+ List *index_pathkeys;
+ List *merge_pathkeys;
+ ListCell *lc;
+ int num_prefixes = 0;
+ int indexcol;
+ bool has_saop_on_first = false;
+ bool has_clause_on_second = false;
+
+ /* Need at least 2 index columns for merge scan */
+ if (index->nkeycolumns < 2)
+ return;
+
+ /* Index must be ordered and support gettuple */
+ if (index->sortopfamily == NULL || !index->amhasgettuple)
+ return;
+
+ /* Must have query pathkeys with at least 2 elements */
+ if (root->query_pathkeys == NIL || list_length(root->query_pathkeys) < 2)
+ return;
+
+ /*
+ * Check for ScalarArrayOpExpr on first column.
+ * Count the number of array elements (prefix values).
+ */
+ foreach(lc, clauses->indexclauses[0])
+ {
+ IndexClause *iclause = (IndexClause *) lfirst(lc);
+ RestrictInfo *rinfo = iclause->rinfo;
+
+ if (IsA(rinfo->clause, ScalarArrayOpExpr))
+ {
+ ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+ Node *arrayarg = (Node *) lsecond(saop->args);
+
+ has_saop_on_first = true;
+
+ /* Try to determine the number of array elements */
+ if (IsA(arrayarg, Const))
+ {
+ Const *con = (Const *) arrayarg;
+
+ if (!con->constisnull)
+ {
+ ArrayType *arr = DatumGetArrayTypeP(con->constvalue);
+ num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+ }
+ }
+ else
+ {
+ /* Can't determine size, estimate conservatively */
+ num_prefixes = 10;
+ }
+ break;
+ }
+ }
+
+ if (!has_saop_on_first || num_prefixes < 2)
+ return;
+
+ /* Check if there's any clause on second column */
+ if (clauses->indexclauses[1] != NIL)
+ has_clause_on_second = true;
+
+ if (!has_clause_on_second)
+ return;
+
+ /*
+ * Get the natural index pathkeys (prefix, suffix order).
+ * We need at least 2 pathkeys for merge scan to make sense.
+ */
+ index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection);
+ if (list_length(index_pathkeys) < 2)
+ return;
+
+ /*
+ * Check if query pathkeys are (suffix, prefix) - the REVERSED order.
+ * query_pathkeys[0] should match index_pathkeys[1] (suffix)
+ * query_pathkeys[1] should match index_pathkeys[0] (prefix)
+ */
+ {
+ PathKey *qpk0 = (PathKey *) linitial(root->query_pathkeys);
+ PathKey *qpk1 = (PathKey *) lsecond(root->query_pathkeys);
+ PathKey *ipk0 = (PathKey *) linitial(index_pathkeys);
+ PathKey *ipk1 = (PathKey *) lsecond(index_pathkeys);
+
+ /* Query's first pathkey must match index's SECOND pathkey (suffix) */
+ if (qpk0->pk_eclass != ipk1->pk_eclass)
+ return;
+
+ /* Query's second pathkey must match index's FIRST pathkey (prefix) */
+ if (qpk1->pk_eclass != ipk0->pk_eclass)
+ return;
+ }
+
+ /*
+ * The merge scan can satisfy the query's ORDER BY (suffix, prefix).
+ * Use the query's pathkeys directly since we've verified they match.
+ * This is critical: PostgreSQL compares pathkeys by pointer equality.
+ */
+ merge_pathkeys = root->query_pathkeys;
+
+ /*
+ * Build the index clause list (same as normal path).
+ */
+ index_clauses = NIL;
+ for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
+ {
+ foreach(lc, clauses->indexclauses[indexcol])
+ {
+ IndexClause *iclause = (IndexClause *) lfirst(lc);
+ index_clauses = lappend(index_clauses, iclause);
+ }
+ }
+
+ /*
+ * Create the merge scan path with (suffix, prefix) pathkeys.
+ */
+ ipath = create_index_path(root, index,
+ index_clauses,
+ NIL, /* no ORDER BY expressions */
+ NIL, /* no ORDER BY columns */
+ merge_pathkeys,
+ ForwardScanDirection,
+ check_index_only(rel, index),
+ NULL, /* no outer relids */
+ 1.0, /* loop_count */
+ false); /* not parallel */
+
+ /* Enable merge scan with K-way merge */
+ ipath->num_merge_prefixes = num_prefixes;
+
+ /*
+ * Adjust costs and row estimate for merge scan.
+ * Merge scan reads exactly (limit + K - 1) tuples instead of all matching.
+ * The row estimate reflects actual tuple accesses, not total matches.
+ */
+ if (root->limit_tuples > 0 && root->limit_tuples < ipath->path.rows)
+ {
+ double merge_rows;
+ double original_rows = ipath->path.rows;
+
+ /* Merge scan reads exactly (limit + K - 1) tuples */
+ merge_rows = root->limit_tuples + num_prefixes - 1;
+ if (merge_rows < original_rows)
+ {
+ double ratio = merge_rows / original_rows;
+
+ /* Scale run cost by ratio of tuples accessed */
+ ipath->path.total_cost = ipath->path.startup_cost +
+ (ipath->path.total_cost - ipath->path.startup_cost) * ratio;
+
+ /* Add startup cost for K index descents */
+ ipath->path.startup_cost += num_prefixes * 0.01 * cpu_operator_cost;
+
+ /* Update row estimate to reflect merge scan efficiency */
+ ipath->path.rows = merge_rows;
+ }
+ }
+
+ /* Submit the path for consideration */
+ add_path(rel, (Path *) ipath);
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e5200f4b3ce..485b4b3e54e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,12 +184,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
+ int num_merge_prefixes,
ScanDirection indexscandir);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *recheckqual,
List *indexorderby,
List *indextlist,
+ int num_merge_prefixes,
ScanDirection indexscandir);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
@@ -3009,6 +3011,7 @@ create_indexscan_plan(PlannerInfo *root,
stripped_indexquals,
fixed_indexorderbys,
indexinfo->indextlist,
+ best_path->num_merge_prefixes,
best_path->indexscandir);
else
scan_plan = (Scan *) make_indexscan(tlist,
@@ -3020,6 +3023,7 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
+ best_path->num_merge_prefixes,
best_path->indexscandir);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5527,6 +5531,7 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
+ int num_merge_prefixes,
ScanDirection indexscandir)
{
IndexScan *node = makeNode(IndexScan);
@@ -5543,6 +5548,7 @@ make_indexscan(List *qptlist,
node->indexorderby = indexorderby;
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
+ node->num_merge_prefixes = num_merge_prefixes;
node->indexorderdir = indexscandir;
return node;
@@ -5557,6 +5563,7 @@ make_indexonlyscan(List *qptlist,
List *recheckqual,
List *indexorderby,
List *indextlist,
+ int num_merge_prefixes,
ScanDirection indexscandir)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
@@ -5572,6 +5579,7 @@ make_indexonlyscan(List *qptlist,
node->recheckqual = recheckqual;
node->indexorderby = indexorderby;
node->indextlist = indextlist;
+ node->num_merge_prefixes = num_merge_prefixes;
node->indexorderdir = indexscandir;
return node;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7b6c5d51e5d..21746cd684c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1075,6 +1075,8 @@ create_index_path(PlannerInfo *root,
pathnode->indexorderbycols = indexorderbycols;
pathnode->indexscandir = indexscandir;
+ pathnode->num_merge_prefixes = 0;
+
cost_index(pathnode, root, loop_count, partial_path);
return pathnode;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076f8..fc55315ee07 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -190,6 +190,9 @@ typedef struct IndexScanDescData
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
+
+ /* Merge scan: K-way merge, ordered by an index suffix */
+ int xs_num_merge_prefixes;
} IndexScanDescData;
/* Generic structure for parallel scans */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..4433d1c2612 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1734,6 +1734,9 @@ typedef struct IndexScanState
bool *iss_OrderByTypByVals;
int16 *iss_OrderByTypLens;
Size iss_PscanLen;
+
+ /* Merge scan: K-way merge */
+ int iss_NumMergePrefixes;
} IndexScanState;
/* ----------------
@@ -1780,6 +1783,8 @@ typedef struct IndexOnlyScanState
Size ioss_PscanLen;
AttrNumber *ioss_NameCStringAttNums;
int ioss_NameCStringCount;
+ /* Merge scan: K-way merge */
+ int ioss_NumMergePrefixes;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fb808823acf..ced7e224a87 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int num_merge_prefixes;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4bc6fb5670e..86d8c92e01f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -597,6 +597,8 @@ typedef struct IndexScan
List *indexorderbyops;
/* forward or backward or don't care */
ScanDirection indexorderdir;
+ /* Merge scan: K-way merge */
+ int num_merge_prefixes;
} IndexScan;
/* ----------------
@@ -645,6 +647,8 @@ typedef struct IndexOnlyScan
List *indextlist;
/* forward or backward or don't care */
ScanDirection indexorderdir;
+ /* Merge scan: K-way merge */
+ int num_merge_prefixes;
} IndexOnlyScan;
/* ----------------
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
index 441ae1d0657..28509b331d7 100644
--- a/src/test/regress/expected/btree_merge.out
+++ b/src/test/regress/expected/btree_merge.out
@@ -82,6 +82,20 @@ SHOW track_counts; -- should be 'on'
on
(1 row)
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_test_idx on btree_merge_test
+ Index Cond: ((x = ANY ('{1,2,5,8,13,21,34,55}'::integer[])) AND (y >= 19))
+(3 rows)
+
-- From the limited query proposition this can be computed with 10
-- tupple accesses.
SELECT x, y
@@ -107,7 +121,7 @@ FROM pg_stat_user_indexes
WHERE indexrelname = 'btree_merge_test_idx';
idx_scan | idx_tup_read | idx_tup_fetch
----------+--------------+---------------
- 5 | 10 | 10
+ 8 | 9 | 3
(1 row)
DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
index be00c33c2a5..ad9cf03f869 100644
--- a/src/test/regress/sql/btree_merge.sql
+++ b/src/test/regress/sql/btree_merge.sql
@@ -81,6 +81,15 @@ ANALYSE btree_merge_test;
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
SHOW track_counts; -- should be 'on'
+
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+
-- From the limited query proposition this can be computed with 10
-- tupple accesses.
SELECT x, y
--
2.40.0
[application/octet-stream] 0001-MERGE-SCAN-Test-the-baseline.patch (7.5K, 5-0001-MERGE-SCAN-Test-the-baseline.patch)
download | inline diff:
From 6dc67b16668edc64dd820c5a313c849cd47da6c3 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 08:35:15 +0000
Subject: [PATCH 1/3] [MERGE-SCAN]: Test the baseline
---
src/test/regress/expected/btree_merge.out | 113 ++++++++++++++++++++++
src/test/regress/sql/btree_merge.sql | 100 +++++++++++++++++++
2 files changed, 213 insertions(+)
create mode 100644 src/test/regress/expected/btree_merge.out
create mode 100644 src/test/regress/sql/btree_merge.sql
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
new file mode 100644
index 00000000000..441ae1d0657
--- /dev/null
+++ b/src/test/regress/expected/btree_merge.out
@@ -0,0 +1,113 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+ANALYSE btree_merge_test;
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+ track_counts
+--------------
+ on
+(1 row)
+
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+ x | y
+---+----
+ 1 | 19
+ 2 | 19
+ 5 | 19
+(3 rows)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush
+--------------------------
+
+(1 row)
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+ idx_scan | idx_tup_read | idx_tup_fetch
+----------+--------------+---------------
+ 5 | 10 | 10
+(1 row)
+
+DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
new file mode 100644
index 00000000000..be00c33c2a5
--- /dev/null
+++ b/src/test/regress/sql/btree_merge.sql
@@ -0,0 +1,100 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+
+
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+
+ANALYSE btree_merge_test;
+
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+
+
+SELECT pg_stat_force_next_flush();
+
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+
+DROP TABLE btree_merge_test;
\ No newline at end of file
--
2.40.0
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
2026-02-04 07:13 ` Re: New access method for b-tree. Michał Kłeczek <[email protected]>
2026-02-05 06:59 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
@ 2026-02-06 10:52 ` Alexandre Felipe <[email protected]>
2026-02-23 22:08 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
0 siblings, 1 reply; 12+ messages in thread
From: Alexandre Felipe @ 2026-02-06 10:52 UTC (permalink / raw)
To: pgsql-hackers; [email protected]; [email protected]; [email protected]; [email protected] <[email protected]>; +Cc: Ants Aasma <[email protected]>; Tomas Vondra <[email protected]>; Alexandre Felipe <[email protected]>; Michał Kłeczek <[email protected]>; [email protected]
Hello again hackers!
[email protected] <[email protected]>: That seems to be the one that is probably the
most familiar with the index scan (based on the commits).
[email protected] <[email protected]> , [email protected]
<[email protected]> , [email protected] <[email protected]> as the
top 3 committers to nbtree over the last ~6 months.
I have made substantial progress on adding a few features. I have
questions, but I will let you go first :)
Motivation:
*In technical terms:* this proposal is to take advantage of a btree index
when the query is filtered by a few distinct prefixes and ordered by a
suffix and has a limit.
*In non technical:* This could help to efficiently render a social network
feed, where each user can select a list of users whose posts they want to
see, and the posts must be ordered from newest to oldest.
*Performance Comparison*
I did a test with a toy table, please find more details below.
With limit 100
| Method | Shared Hit | Shared Read | Exec Time |
|------------|-----------:|------------:|----------:|
| Merge | 13 | 119 | 13 ms |
| IndexScan | 15,308 | 525,310 | 3,409 ms |
With limit 1,000,000
| Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
|------------|-----------:|---------:|-------:|-------:|----------:|
| Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
| Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
| Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
| IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |
Sequential scans and bitmap scans in this case reduces significantly the
number of
accessed buff because the table has only four integer columns, and these
methods
can read all the lines on a given page at a time.
However that comes at the cost of resorting to an in-disk sort method.
For the query with limit 100 we get no temp files as we are using a
top-100 sort.
make check passes
*Experiment details*
Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
```sql
CREATE TABLE grid AS (
SELECT a, b, c, d, FROM
generate_series(1, 100) AS a,
generate_series(1, 100) AS b,
generate_series(1, 100) AS c,
generate_series(1, 100) AS d
);
CREATE INDEX grid_index ON grid (a, b, c);
ANALYSE grid;
```
Now let's say that we need to find certain number of rows filtered by a and
ordered by b;
```sql
PREPARE grid_query(int) AS
SELECT sum(d) FROM (
SELECT * FROM grid
WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
ORDER BY b
LIMIT $1) t;
```
---
Now with limit 100, with index merge scan (notice Index Prefixes in the
plan).
```sql
SET enable_indexmergescan = on;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```
```text
Buffers: shared hit=13 read=119
-> Limit (cost=0.57..87.29 rows=100 width=16) (actual
time=5.528..12.999 rows=100.00 loops=1)
Buffers: shared hit=13 read=119
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes: *(a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=13 read=119
Planning:
Buffers: shared hit=59 read=23
Planning Time: 4.619 ms
Execution Time: 13.055 ms
```
```sql
SET enable_indexmergescan = off;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```
```text
Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
time=3406.624..3408.710 rows=1.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
time=3406.601..3408.702 rows=100.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15308 read=525310
-> Sort (cost=1602575.14..1610720.98 rows=3258333
width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
Sort Key: grid.b
Sort Method: top-N heapsort Memory: 32kB
Buffers: shared hit=15308 read=525310
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 32kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15234 read=525310
Planning Time: 0.370 ms
Execution Time: 3409.134 ms
```
Now queries with limit 1,000,000
```sql
SET enable_indexmergescan = on;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
Query executed with the proposed access method. Notice in the plan Index
Prefixes and Index Cond.
```text
Buffers: shared hit=980318 read=19721
-> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
time=2.854..2103.438 rows=1000000.00 loops=1)
Buffers: shared hit=980318 read=19721
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..867265.91
rows=1000007 width=16) (actual time=2.852..2066.205 rows=1000000.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes:* (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=980318 read=19721
Planning Time: 0.328 ms
Execution Time: 2127.811 ms
```
If we disable index_mergescan we naturally we fall into a sequential scan.
```sql
SET enable_indexmergescan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
-> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
time=3467.012..3712.044 rows=1000000.00 loops=1)
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Sort (cost=1941895.62..1950041.45 rows=3258333
width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47016kB*
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
Worker 0: Sort Method: external merge Disk: 46976kB
Worker 1: Sort Method: external merge Disk: 47000kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15134 read=525410
Planning Time: 0.332 ms
Execution Time: 3761.866 ms
```
If we disable sequential scans, then we get a bitmap scan
```sql
SET enable_seqscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
time=5170.456..5453.433 rows=1000000.00 loops=1)
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
-> Sort (cost=1997199.75..2005345.59 rows=3258333
width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
Sort Key: grid.b
Sort Method: external merge Disk: 47032kB
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
Worker 0: Sort Method: external merge Disk: 47280kB
Worker 1: Sort Method: external merge Disk: 46680kB
-> Parallel Bitmap Heap Scan on grid
(cost=107691.54..1533348.13 rows=3258333 width=16) (actual
time=299.891..4489.787 rows=2666666.67 loops=3)
Recheck Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Rows *Removed by Index Recheck*: 2410242
Heap Blocks: exact=13100 lossy=22639
Buffers: shared hit=615 read=113759 written=2
Worker 0: Heap Blocks: exact=13077 lossy=22755
Worker 1: Heap Blocks: exact=13036 lossy=22421
-> *Bitmap Index Scan* on grid_a_b_c_idx
(cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
rows=8000000.00 loops=1)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=13 read=7293 written=2
Planning Time: 0.165 ms
Execution Time: 5487.213 ms
```
If we disable bitmap scans we finally get an index scan
```sql
SET enable_bitmapscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```
Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
-> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
time=4414.478..4674.400 rows=1000000.00 loops=1)
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Sort (cost=7200203.05..7208348.88 rows=3258333
width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47304kB*
Buffers: shared hit=7883221 read=124111, temp
read=20699 written=35385
Worker 0: Sort Method: external merge Disk: 47304kB
Worker 1: Sort Method: external merge Disk: 46384kB
-> *Parallel Index Scan* using grid_a_b_c_idx on grid
(cost=0.57..6736351.43 rows=3258333 width=16) (actual
time=46.925..3796.915 rows=2666666.67 loops=3)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=7883208 read=124110
Planning Time: 0.385 ms
Execution Time: 4713.325 ms
```
On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
[email protected]> wrote:
> Thank you for looking into this.
>
> Now we can execute a, still narrow, family queries!
>
> Maybe it helps to see this as a *social network feeds*. Imagine a social
> network, you have a few friends, or follow a few people, and you want to
> see their updates ordered by date. For each user we have a different
> combination of users that we have to display. But maybe, even having
> hundreds of users we will only show the first 10.
>
> There is a low hanging fruit on the skip scan, if we need N rows, and one
> group already has M rows we could stop there.
> If Nx is the number of friends, and M is the number of posts to show.
> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
> instead of (Nx * N) followed by an (Nx * N) sort.
> Where M = 10 and N is 1000 this is a significant improvement.
> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
> heap operations.
> If everything is on the same page the skip scan would win.
>
> The cost estimation is probably far off.
> I am also not considering the filters applied after this operator, and I
> don't know if the planner infrastructure is able to adjust it by itself.
> This is where I would like reviewer's feedback. I think that the planner
> costs are something to be determined experimentally.
>
> Next I will make it slightly more general handling
> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
> ORDER BY b LIMIT N (ignoring s...)
> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
> (a, b, c)
>
> ---
> Kind Regards,
> Alexandre
>
> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]> wrote:
>
>>
>>
>> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>>
>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>>
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>>
>>
>> I have seen this pattern multiple times. My nickname for it is the
>> timeline view. Think of the social media timeline, showing posts from
>> all followed accounts in timestamp order, returned in reasonably sized
>> batches. The naive SQL query will have to scan all posts from all
>> followed accounts and pass them through a top-N sort. When the total
>> number of posts is much larger than the batch size this is much slower
>> than what is proposed here (assuming I understand it correctly) -
>> effectively equivalent to running N index scans through Merge Append.
>>
>>
>> My workarounds I have proposed users have been either to rewrite the
>> query as a UNION ALL of a set of single value prefix queries wrapped
>> in an order by limit. This gives the exact needed merge append plan
>> shape. But repeating the query N times can get unwieldy when the
>> number of values grows, so the fallback is:
>>
>> SELECT * FROM unnest(:friends) id, LATERAL (
>> SELECT * FROM posts
>> WHERE user_id = id
>> ORDER BY tstamp DESC LIMIT 100)
>> ORDER BY tstamp DESC LIMIT 100;
>>
>> The downside of this formulation is that we still have to fetch a
>> batch worth of items from scans where we otherwise would have only had
>> to look at one index tuple.
>>
>>
>> GIST can be used to handle this kind of queries as it supports multiple
>> sort orders.
>> The only problem is that GIST does not support ORDER BY column.
>> One possible workaround is [1] but as described there it does not play
>> well with partitioning.
>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>> I think it would be easier to implement and maintain than a new IAM (but
>> I don’t have enough knowledge and experience to implement it myself)
>>
>> [1]
>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>> [2]
>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>
>> —
>> Kind regards,
>> Michal
>>
>
Attachments:
[application/octet-stream] 0003-MERGE-SCAN-Planner-integration.patch (27.6K, 3-0003-MERGE-SCAN-Planner-integration.patch)
download | inline diff:
From ad123a3f8da3d95262b2553e90dd9c8fbb8d2335 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Thu, 5 Feb 2026 05:09:48 +0000
Subject: [PATCH 3/4] [MERGE-SCAN] Planner integration
---
src/backend/access/index/genam.c | 2 +
src/backend/access/nbtree/nbtmergescan.c | 60 ++++++-
src/backend/access/nbtree/nbtree.c | 129 +++++++++++++++
src/backend/executor/nodeIndexonlyscan.c | 5 +-
src/backend/executor/nodeIndexscan.c | 11 ++
src/backend/optimizer/path/indxpath.c | 188 ++++++++++++++++++++++
src/backend/optimizer/plan/createplan.c | 8 +
src/backend/optimizer/util/pathnode.c | 2 +
src/include/access/relscan.h | 3 +
src/include/nodes/execnodes.h | 5 +
src/include/nodes/pathnodes.h | 1 +
src/include/nodes/plannodes.h | 4 +
src/test/regress/expected/btree_merge.out | 16 +-
src/test/regress/sql/btree_merge.sql | 9 ++
14 files changed, 437 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62c..53615fb08d2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ scan->xs_num_merge_prefixes = 0;
+
return scan;
}
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
index 70828dc73d3..eda1e683525 100644
--- a/src/backend/access/nbtree/nbtmergescan.c
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -27,6 +27,7 @@
#include "access/relscan.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/bufmgr.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
@@ -169,7 +170,8 @@ bt_merge_init(IndexScanDesc scan,
cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
cursor->sort_key_isnull = true;
BTScanPosInvalidate(cursor->pos);
- cursor->tuples = NULL;
+ /* Allocate tuple workspace for index-only scans */
+ cursor->tuples = palloc(BLCKSZ);
}
/* Initialize the merge heap */
@@ -219,6 +221,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
state->active_cursors++;
}
}
+
+ /*
+ * Track internal tuple reads for stats. We read active_cursors tuples
+ * during initialization. One of these will be returned first and
+ * counted by index_getnext_tid, so we count (active_cursors - 1) here.
+ */
+ if (state->active_cursors > 1)
+ pgstat_count_index_tuples(scan->indexRelation,
+ state->active_cursors - 1);
}
/* Get the cursor with the smallest suffix value */
@@ -228,9 +239,15 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
node = pairingheap_remove_first(state->merge_heap);
cursor = pairingheap_container(BTMergeCursor, ph_node, node);
- /* Set up the heap TID from the current cursor position */
+ /* Set up the heap TID and index tuple from the current cursor position */
Assert(BTScanPosIsValid(cursor->pos));
- scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+ {
+ BTScanPosItem *currItem = &cursor->pos.items[cursor->pos.itemIndex];
+ scan->xs_heaptid = currItem->heapTid;
+ /* For index-only scans, set the index tuple pointer */
+ if (cursor->tuples)
+ scan->xs_itup = (IndexTuple) (cursor->tuples + currItem->tupleOffset);
+ }
/* Advance cursor to next tuple */
if (bt_merge_cursor_advance(state, scan, cursor))
@@ -255,9 +272,23 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
void
bt_merge_end(BTMergeScanState *state)
{
+ int i;
+
if (state == NULL)
return;
+ /* Release any buffer pins held by cursors */
+ for (i = 0; i < state->num_cursors; i++)
+ {
+ BTMergeCursor *cursor = &state->cursors[i];
+
+ if (BTScanPosIsValid(cursor->pos) && BufferIsValid(cursor->pos.buf))
+ {
+ ReleaseBuffer(cursor->pos.buf);
+ cursor->pos.buf = InvalidBuffer;
+ }
+ }
+
/* Free the memory context, which frees all allocations */
MemoryContextDelete(state->merge_context);
}
@@ -302,8 +333,14 @@ bt_merge_cursor_init(BTMergeScanState *state,
/* Invalidate current position to force _bt_first */
BTScanPosInvalidate(so->currPos);
- /* Disable array key handling for this cursor's scan */
+ /*
+ * Disable array key handling for this cursor's scan.
+ * We need to clear both numArrayKeys and needPrimScan to avoid
+ * assertions in _bt_readfirstpage that expect array keys when
+ * needPrimScan is set.
+ */
so->numArrayKeys = 0;
+ so->needPrimScan = false;
/* Position at first matching tuple */
found = _bt_first(scan, state->direction);
@@ -313,6 +350,16 @@ bt_merge_cursor_init(BTMergeScanState *state,
/* Copy position to cursor */
memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+ /*
+ * Copy the tuple data for index-only scans.
+ * The tuple workspace contains copies of index tuples referenced
+ * by items in currPos.
+ */
+ if (so->currTuples && so->currPos.nextTupleOffset > 0)
+ {
+ memcpy(cursor->tuples, so->currTuples, so->currPos.nextTupleOffset);
+ }
+
/* Extract the sort key for heap ordering */
cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
&cursor->sort_key_isnull);
@@ -390,6 +437,11 @@ bt_merge_cursor_advance(BTMergeScanState *state,
if (found)
{
+ /*
+ * Don't count here - the advanced-to tuple will be returned later
+ * and counted by index_getnext_tid at that time.
+ */
+
/* Extract new sort key */
cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
&cursor->sort_key_isnull);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee657d..0e55c4874b4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,6 +21,8 @@
#include "access/nbtree.h"
#include "access/relscan.h"
#include "access/stratnum.h"
+#include "catalog/pg_amop.h"
+#include "utils/array.h"
#include "commands/progress.h"
#include "commands/vacuum.h"
#include "nodes/execnodes.h"
@@ -34,6 +36,7 @@
#include "utils/datum.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -98,6 +101,8 @@ static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btsca
BTScanOpaque so);
static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
BTScanOpaque so);
+static bool bt_init_merge_scan_from_keys(IndexScanDesc scan);
+
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
@@ -221,6 +226,106 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+/*
+ * bt_init_merge_scan_from_keys
+ * Initialize merge scan state from the preprocessed scan keys.
+ *
+ * Returns true if merge scan was successfully initialized.
+ * Returns false if merge scan cannot be used (e.g., no suitable array key).
+ */
+static bool
+bt_init_merge_scan_from_keys(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey arrayKey = NULL;
+ ArrayType *arr;
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ int prefix_attno;
+ int suffix_attno;
+ Oid suffix_cmp_oid;
+ Oid suffix_collation;
+ Oid opfamily;
+ Oid elemtype;
+ int16 elemlen;
+ bool elembyval;
+ char elemalign;
+ int i;
+
+ /* Look for SK_SEARCHARRAY on first column in the raw scan keys */
+ for (i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey sk = &scan->keyData[i];
+
+ if ((sk->sk_flags & SK_SEARCHARRAY) &&
+ sk->sk_attno == 1 &&
+ sk->sk_strategy == BTEqualStrategyNumber)
+ {
+ arrayKey = sk;
+ break;
+ }
+ }
+
+ if (arrayKey == NULL)
+ return false;
+
+ /* Extract array values from the scan key */
+ arr = DatumGetArrayTypeP(arrayKey->sk_argument);
+ num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+
+ if (num_prefixes < 2)
+ return false;
+
+ /* Get array element type info */
+ elemtype = ARR_ELEMTYPE(arr);
+ get_typlenbyvalalign(elemtype, &elemlen, &elembyval, &elemalign);
+
+ /* Deconstruct the array into individual elements */
+ deconstruct_array(arr, elemtype, elemlen, elembyval, elemalign,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ /* Attribute numbers (1-based) */
+ prefix_attno = 1;
+ suffix_attno = 2;
+
+ /* Get the opfamily from the index */
+ opfamily = rel->rd_opfamily[suffix_attno - 1];
+
+ /* Get collation from the suffix column */
+ suffix_collation = TupleDescAttr(itupdesc, suffix_attno - 1)->attcollation;
+
+ /* Get the comparison function OID for the suffix column */
+ suffix_cmp_oid = get_opfamily_proc(opfamily,
+ TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+ TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
+ BTORDER_PROC);
+
+ if (!OidIsValid(suffix_cmp_oid))
+ {
+ pfree(prefix_values);
+ pfree(prefix_nulls);
+ return false;
+ }
+
+ /* Initialize the merge scan state */
+ so->mergeState = bt_merge_init(scan,
+ prefix_values,
+ prefix_nulls,
+ num_prefixes,
+ prefix_attno,
+ suffix_attno,
+ suffix_cmp_oid,
+ suffix_collation);
+
+ pfree(prefix_values);
+ pfree(prefix_nulls);
+
+ return (so->mergeState != NULL);
+}
+
/*
* btgettuple() -- Get the next tuple in the scan.
*/
@@ -235,6 +340,24 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
/* btree indexes are never lossy */
scan->xs_recheck = false;
+ /*
+ * Check if merge scan optimization should be used.
+ * Initialize merge scan state on first call if needed.
+ */
+ if (scan->xs_num_merge_prefixes > 0 && so->mergeState == NULL)
+ {
+ if (!bt_init_merge_scan_from_keys(scan))
+ {
+ /* Merge scan init failed, fall through to regular scan */
+ scan->xs_num_merge_prefixes = 0;
+ }
+ }
+
+ /* Use merge scan if initialized */
+ /* Use merge scan if initialized */
+ if (so->mergeState != NULL)
+ return bt_merge_getnext(scan, dir);
+
/* Each loop iteration performs another primitive index scan */
do
{
@@ -365,6 +488,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ /* Initialize merge scan state to NULL */
+ so->mergeState = NULL;
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -486,6 +612,9 @@ btendscan(IndexScanDesc scan)
pfree(so->killedItems);
if (so->currTuples != NULL)
pfree(so->currTuples);
+ /* Clean up merge scan state */
+ if (so->mergeState != NULL)
+ bt_merge_end(so->mergeState);
/* so->markTuples should not be pfree'd, see btrescan */
pfree(so);
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d09374517..70483c4e767 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -98,6 +98,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
node->ioss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
/* Set it up for index-only scan */
node->ioss_ScanDesc->xs_want_itup = true;
@@ -615,7 +616,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
indexstate->ioss_RuntimeKeysReady = false;
indexstate->ioss_RuntimeKeys = NULL;
indexstate->ioss_NumRuntimeKeys = 0;
-
+ indexstate->ioss_NumMergePrefixes = node->num_merge_prefixes;
/*
* build the index scan keys from the index qualification
*/
@@ -790,6 +791,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_NumOrderByKeys,
piscan);
node->ioss_ScanDesc->xs_want_itup = true;
+ node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
node->ioss_VMBuffer = InvalidBuffer;
/*
@@ -856,6 +858,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_NumOrderByKeys,
piscan);
node->ioss_ScanDesc->xs_want_itup = true;
+ node->ioss_ScanDesc->xs_num_merge_prefixes = node->ioss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff04c..9e62cacd2d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,7 @@ IndexNext(IndexScanState *node)
node->iss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and
* pass the scankeys to the index AM.
@@ -211,6 +212,8 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_ScanDesc = scandesc;
+ scandesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
/*
* If no run-time keys to calculate or they are ready, go ahead and
* pass the scankeys to the index AM.
@@ -1086,6 +1089,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->iss_RuntimeContext = NULL;
}
+ /*
+ * Initialize merge scan state from plan node
+ */
+ indexstate->iss_NumMergePrefixes = node->num_merge_prefixes;
+
/*
* all done.
*/
@@ -1725,6 +1733,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_NumOrderByKeys,
piscan);
+ node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
+
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
* the scankeys to the index AM.
@@ -1789,6 +1799,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_NumOrderByKeys,
piscan);
+ node->iss_ScanDesc->xs_num_merge_prefixes = node->iss_NumMergePrefixes;
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
* the scankeys to the index AM.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 67d9dc35f44..44b79f91335 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -16,6 +16,7 @@
#include "postgres.h"
#include "access/stratnum.h"
+#include "utils/array.h"
#include "access/sysattr.h"
#include "access/transam.h"
#include "catalog/pg_am.h"
@@ -102,6 +103,8 @@ static bool eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
List **bitindexpaths);
+static void consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+ IndexOptInfo *index, IndexClauseSet *clauses);
static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
@@ -770,6 +773,191 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
+
+ /*
+ * Consider merge scan optimization for queries with:
+ * - ScalarArrayOpExpr (IN clause) on first index column
+ * - ORDER BY on second column (different from index leading column)
+ * - Optionally LIMIT
+ */
+ consider_merge_scan_path(root, rel, index, clauses);
+}
+
+/*
+ * consider_merge_scan_path
+ * Check if this index can provide a merge scan path for queries of the form:
+ * WHERE prefix IN (...) AND suffix >= b ORDER BY suffix, prefix LIMIT N
+ *
+ * Merge scan allows lazily producing output sorted by (suffix, prefix) from
+ * an index on (prefix, suffix) by doing a K-way merge of K separate scans.
+ */
+static void
+consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+ IndexOptInfo *index, IndexClauseSet *clauses)
+{
+ IndexPath *ipath;
+ List *index_clauses;
+ List *index_pathkeys;
+ List *merge_pathkeys;
+ ListCell *lc;
+ int num_prefixes = 0;
+ int indexcol;
+ bool has_saop_on_first = false;
+ bool has_clause_on_second = false;
+
+ /* Need at least 2 index columns for merge scan */
+ if (index->nkeycolumns < 2)
+ return;
+
+ /* Index must be ordered and support gettuple */
+ if (index->sortopfamily == NULL || !index->amhasgettuple)
+ return;
+
+ /* Must have query pathkeys with at least 2 elements */
+ if (root->query_pathkeys == NIL || list_length(root->query_pathkeys) < 2)
+ return;
+
+ /*
+ * Check for ScalarArrayOpExpr on first column.
+ * Count the number of array elements (prefix values).
+ */
+ foreach(lc, clauses->indexclauses[0])
+ {
+ IndexClause *iclause = (IndexClause *) lfirst(lc);
+ RestrictInfo *rinfo = iclause->rinfo;
+
+ if (IsA(rinfo->clause, ScalarArrayOpExpr))
+ {
+ ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+ Node *arrayarg = (Node *) lsecond(saop->args);
+
+ has_saop_on_first = true;
+
+ /* Try to determine the number of array elements */
+ if (IsA(arrayarg, Const))
+ {
+ Const *con = (Const *) arrayarg;
+
+ if (!con->constisnull)
+ {
+ ArrayType *arr = DatumGetArrayTypeP(con->constvalue);
+ num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+ }
+ }
+ else
+ {
+ /* Can't determine size, estimate conservatively */
+ num_prefixes = 10;
+ }
+ break;
+ }
+ }
+
+ if (!has_saop_on_first || num_prefixes < 2)
+ return;
+
+ /* Check if there's any clause on second column */
+ if (clauses->indexclauses[1] != NIL)
+ has_clause_on_second = true;
+
+ if (!has_clause_on_second)
+ return;
+
+ /*
+ * Get the natural index pathkeys (prefix, suffix order).
+ * We need at least 2 pathkeys for merge scan to make sense.
+ */
+ index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection);
+ if (list_length(index_pathkeys) < 2)
+ return;
+
+ /*
+ * Check if query pathkeys are (suffix, prefix) - the REVERSED order.
+ * query_pathkeys[0] should match index_pathkeys[1] (suffix)
+ * query_pathkeys[1] should match index_pathkeys[0] (prefix)
+ */
+ {
+ PathKey *qpk0 = (PathKey *) linitial(root->query_pathkeys);
+ PathKey *qpk1 = (PathKey *) lsecond(root->query_pathkeys);
+ PathKey *ipk0 = (PathKey *) linitial(index_pathkeys);
+ PathKey *ipk1 = (PathKey *) lsecond(index_pathkeys);
+
+ /* Query's first pathkey must match index's SECOND pathkey (suffix) */
+ if (qpk0->pk_eclass != ipk1->pk_eclass)
+ return;
+
+ /* Query's second pathkey must match index's FIRST pathkey (prefix) */
+ if (qpk1->pk_eclass != ipk0->pk_eclass)
+ return;
+ }
+
+ /*
+ * The merge scan can satisfy the query's ORDER BY (suffix, prefix).
+ * Use the query's pathkeys directly since we've verified they match.
+ * This is critical: PostgreSQL compares pathkeys by pointer equality.
+ */
+ merge_pathkeys = root->query_pathkeys;
+
+ /*
+ * Build the index clause list (same as normal path).
+ */
+ index_clauses = NIL;
+ for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
+ {
+ foreach(lc, clauses->indexclauses[indexcol])
+ {
+ IndexClause *iclause = (IndexClause *) lfirst(lc);
+ index_clauses = lappend(index_clauses, iclause);
+ }
+ }
+
+ /*
+ * Create the merge scan path with (suffix, prefix) pathkeys.
+ */
+ ipath = create_index_path(root, index,
+ index_clauses,
+ NIL, /* no ORDER BY expressions */
+ NIL, /* no ORDER BY columns */
+ merge_pathkeys,
+ ForwardScanDirection,
+ check_index_only(rel, index),
+ NULL, /* no outer relids */
+ 1.0, /* loop_count */
+ false); /* not parallel */
+
+ /* Enable merge scan with K-way merge */
+ ipath->num_merge_prefixes = num_prefixes;
+
+ /*
+ * Adjust costs and row estimate for merge scan.
+ * Merge scan reads exactly (limit + K - 1) tuples instead of all matching.
+ * The row estimate reflects actual tuple accesses, not total matches.
+ */
+ if (root->limit_tuples > 0 && root->limit_tuples < ipath->path.rows)
+ {
+ double merge_rows;
+ double original_rows = ipath->path.rows;
+
+ /* Merge scan reads exactly (limit + K - 1) tuples */
+ merge_rows = root->limit_tuples + num_prefixes - 1;
+ if (merge_rows < original_rows)
+ {
+ double ratio = merge_rows / original_rows;
+
+ /* Scale run cost by ratio of tuples accessed */
+ ipath->path.total_cost = ipath->path.startup_cost +
+ (ipath->path.total_cost - ipath->path.startup_cost) * ratio;
+
+ /* Add startup cost for K index descents */
+ ipath->path.startup_cost += num_prefixes * 0.01 * cpu_operator_cost;
+
+ /* Update row estimate to reflect merge scan efficiency */
+ ipath->path.rows = merge_rows;
+ }
+ }
+
+ /* Submit the path for consideration */
+ add_path(rel, (Path *) ipath);
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e5200f4b3ce..485b4b3e54e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,12 +184,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
Oid indexid, List *indexqual, List *indexqualorig,
List *indexorderby, List *indexorderbyorig,
List *indexorderbyops,
+ int num_merge_prefixes,
ScanDirection indexscandir);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *recheckqual,
List *indexorderby,
List *indextlist,
+ int num_merge_prefixes,
ScanDirection indexscandir);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
@@ -3009,6 +3011,7 @@ create_indexscan_plan(PlannerInfo *root,
stripped_indexquals,
fixed_indexorderbys,
indexinfo->indextlist,
+ best_path->num_merge_prefixes,
best_path->indexscandir);
else
scan_plan = (Scan *) make_indexscan(tlist,
@@ -3020,6 +3023,7 @@ create_indexscan_plan(PlannerInfo *root,
fixed_indexorderbys,
indexorderbys,
indexorderbyops,
+ best_path->num_merge_prefixes,
best_path->indexscandir);
copy_generic_path_info(&scan_plan->plan, &best_path->path);
@@ -5527,6 +5531,7 @@ make_indexscan(List *qptlist,
List *indexorderby,
List *indexorderbyorig,
List *indexorderbyops,
+ int num_merge_prefixes,
ScanDirection indexscandir)
{
IndexScan *node = makeNode(IndexScan);
@@ -5543,6 +5548,7 @@ make_indexscan(List *qptlist,
node->indexorderby = indexorderby;
node->indexorderbyorig = indexorderbyorig;
node->indexorderbyops = indexorderbyops;
+ node->num_merge_prefixes = num_merge_prefixes;
node->indexorderdir = indexscandir;
return node;
@@ -5557,6 +5563,7 @@ make_indexonlyscan(List *qptlist,
List *recheckqual,
List *indexorderby,
List *indextlist,
+ int num_merge_prefixes,
ScanDirection indexscandir)
{
IndexOnlyScan *node = makeNode(IndexOnlyScan);
@@ -5572,6 +5579,7 @@ make_indexonlyscan(List *qptlist,
node->recheckqual = recheckqual;
node->indexorderby = indexorderby;
node->indextlist = indextlist;
+ node->num_merge_prefixes = num_merge_prefixes;
node->indexorderdir = indexscandir;
return node;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7b6c5d51e5d..21746cd684c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1075,6 +1075,8 @@ create_index_path(PlannerInfo *root,
pathnode->indexorderbycols = indexorderbycols;
pathnode->indexscandir = indexscandir;
+ pathnode->num_merge_prefixes = 0;
+
cost_index(pathnode, root, loop_count, partial_path);
return pathnode;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce340c076f8..fc55315ee07 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -190,6 +190,9 @@ typedef struct IndexScanDescData
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
+
+ /* Merge scan: K-way merge, ordered by an index suffix */
+ int xs_num_merge_prefixes;
} IndexScanDescData;
/* Generic structure for parallel scans */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..4433d1c2612 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1734,6 +1734,9 @@ typedef struct IndexScanState
bool *iss_OrderByTypByVals;
int16 *iss_OrderByTypLens;
Size iss_PscanLen;
+
+ /* Merge scan: K-way merge */
+ int iss_NumMergePrefixes;
} IndexScanState;
/* ----------------
@@ -1780,6 +1783,8 @@ typedef struct IndexOnlyScanState
Size ioss_PscanLen;
AttrNumber *ioss_NameCStringAttNums;
int ioss_NameCStringCount;
+ /* Merge scan: K-way merge */
+ int ioss_NumMergePrefixes;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fb808823acf..ced7e224a87 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,7 @@ typedef struct IndexPath
ScanDirection indexscandir;
Cost indextotalcost;
Selectivity indexselectivity;
+ int num_merge_prefixes;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4bc6fb5670e..86d8c92e01f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -597,6 +597,8 @@ typedef struct IndexScan
List *indexorderbyops;
/* forward or backward or don't care */
ScanDirection indexorderdir;
+ /* Merge scan: K-way merge */
+ int num_merge_prefixes;
} IndexScan;
/* ----------------
@@ -645,6 +647,8 @@ typedef struct IndexOnlyScan
List *indextlist;
/* forward or backward or don't care */
ScanDirection indexorderdir;
+ /* Merge scan: K-way merge */
+ int num_merge_prefixes;
} IndexOnlyScan;
/* ----------------
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
index 441ae1d0657..28509b331d7 100644
--- a/src/test/regress/expected/btree_merge.out
+++ b/src/test/regress/expected/btree_merge.out
@@ -82,6 +82,20 @@ SHOW track_counts; -- should be 'on'
on
(1 row)
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_test_idx on btree_merge_test
+ Index Cond: ((x = ANY ('{1,2,5,8,13,21,34,55}'::integer[])) AND (y >= 19))
+(3 rows)
+
-- From the limited query proposition this can be computed with 10
-- tupple accesses.
SELECT x, y
@@ -107,7 +121,7 @@ FROM pg_stat_user_indexes
WHERE indexrelname = 'btree_merge_test_idx';
idx_scan | idx_tup_read | idx_tup_fetch
----------+--------------+---------------
- 5 | 10 | 10
+ 8 | 9 | 3
(1 row)
DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
index be00c33c2a5..ad9cf03f869 100644
--- a/src/test/regress/sql/btree_merge.sql
+++ b/src/test/regress/sql/btree_merge.sql
@@ -81,6 +81,15 @@ ANALYSE btree_merge_test;
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
SHOW track_counts; -- should be 'on'
+
+-- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+EXPLAIN (COSTS OFF)
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x
+LIMIT 3;
+
-- From the limited query proposition this can be computed with 10
-- tupple accesses.
SELECT x, y
--
2.40.0
[application/octet-stream] 0004-MERGE-SCAN-Multi-column.patch (61.3K, 4-0004-MERGE-SCAN-Multi-column.patch)
download | inline diff:
From e8377401efd1af0d6489fc12eaba5bfd0d396b37 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Thu, 5 Feb 2026 10:36:51 +0000
Subject: [PATCH 4/4] [MERGE-SCAN] Multi column
Hande equality or SAOP constraints on multiple leading columns.
Imposes the correct order on the entire prefix
(ASC|DESC) NULLS (FIRST|LAST)
Supports backward scans.
Adds enable_indexmergescan parameter
---
src/backend/access/nbtree/nbtmergescan.c | 341 ++++++++++++----------
src/backend/access/nbtree/nbtree.c | 215 ++++++++++----
src/backend/commands/explain.c | 94 ++++--
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/indxpath.c | 183 +++++++-----
src/backend/optimizer/plan/createplan.c | 84 +++++-
src/backend/optimizer/util/pathnode.c | 1 +
src/backend/utils/misc/guc_parameters.dat | 7 +
src/include/access/nbtree.h | 36 +--
src/include/nodes/pathnodes.h | 1 +
src/include/nodes/plannodes.h | 4 +
src/include/optimizer/cost.h | 1 +
src/test/regress/expected/btree_merge.out | 278 +++++++++++++++++-
src/test/regress/sql/btree_merge.sql | 159 +++++++++-
14 files changed, 1058 insertions(+), 347 deletions(-)
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
index eda1e683525..0f1444b49b6 100644
--- a/src/backend/access/nbtree/nbtmergescan.c
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -23,6 +23,7 @@
*/
#include "postgres.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relscan.h"
#include "lib/pairingheap.h"
@@ -40,27 +41,43 @@ static int bt_merge_heap_cmp(const pairingheap_node *a,
void *arg);
static bool bt_merge_cursor_init(BTMergeScanState *state,
IndexScanDesc scan,
- BTMergeCursor *cursor,
- Datum prefix_value,
- bool prefix_isnull);
+ BTMergeCursor *cursor);
static bool bt_merge_cursor_advance(BTMergeScanState *state,
IndexScanDesc scan,
BTMergeCursor *cursor);
-static Datum bt_merge_extract_sortkey(BTMergeScanState *state,
- IndexScanDesc scan,
- BTMergeCursor *cursor,
- bool *isnull);
+static IndexTuple bt_merge_get_index_tuple(BTMergeCursor *cursor);
+/*
+ * bt_merge_get_index_tuple
+ * Get the current index tuple from a cursor.
+ *
+ * Returns the IndexTuple pointer from cursor->tuples, or NULL if exhausted.
+ */
+static IndexTuple
+bt_merge_get_index_tuple(BTMergeCursor *cursor)
+{
+ BTScanPosItem *currItem;
+
+ if (cursor->exhausted || cursor->tuples == NULL)
+ return NULL;
+
+ currItem = &cursor->pos.items[cursor->pos.itemIndex];
+ return (IndexTuple) (cursor->tuples + currItem->tupleOffset);
+}
+
/*
* bt_merge_heap_cmp
- * Compare two cursors by their current sort key (suffix value).
+ * Compare two cursors by their current sort key (all suffix columns).
*
- * When sort keys are equal, uses prefix value as tiebreaker for
- * deterministic ordering (ORDER BY suffix, prefix).
+ * Compares all suffix columns in order. When all suffix columns are equal,
+ * uses cursor_id as tiebreaker for deterministic ordering (preserves
+ * original prefix array order).
*
- * Returns positive if a > b (pairingheap is a max-heap, we want min-heap
- * behavior so we invert the comparison).
+ * returns
+ * -1 if a comes before b
+ * 1 if b comes before a
+ * 0 if a and b are equal
*/
static int
bt_merge_heap_cmp(const pairingheap_node *a,
@@ -72,41 +89,65 @@ bt_merge_heap_cmp(const pairingheap_node *a,
(pairingheap_node *) a);
BTMergeCursor *cursor_b = pairingheap_container(BTMergeCursor, ph_node,
(pairingheap_node *) b);
- Datum key_a = cursor_a->sort_key;
- Datum key_b = cursor_b->sort_key;
- bool null_a = cursor_a->sort_key_isnull;
- bool null_b = cursor_b->sort_key_isnull;
- int32 cmp;
-
- /* Handle NULLs - NULLs sort last (NULLS LAST default for ASC) */
- if (null_a && null_b)
- return 0;
- if (null_a)
- return -1; /* a is NULL, comes after b */
- if (null_b)
- return 1; /* b is NULL, comes after a */
-
- /* Compare using the suffix column's comparison function */
- cmp = DatumGetInt32(FunctionCall2Coll(&state->suffix_cmp,
- state->suffix_collation,
- key_a, key_b));
-
- /*
- * Use prefix value as tiebreaker for deterministic ordering.
- * This ensures ORDER BY suffix, prefix behavior.
- */
- if (cmp == 0)
+ IndexTuple itup_a;
+ IndexTuple itup_b;
+ int32 cmp = 0;
+ int col;
+
+ /* Get the index tuples from each cursor */
+ itup_a = bt_merge_get_index_tuple(cursor_a);
+ itup_b = bt_merge_get_index_tuple(cursor_b);
+
+ /* Handle exhausted cursors */
+ if (itup_a == NULL && itup_b == NULL)
+ return cursor_b->cursor_id - cursor_a->cursor_id;
+ if (itup_a == NULL)
+ return -1; /* a is exhausted, comes after b */
+ if (itup_b == NULL)
+ return 1; /* b is exhausted, comes after a */
+
+ /* Compare all suffix columns in order */
+ for (col = 0; col < state->index_rel->rd_index->indnkeyatts - state->num_prefix_cols && cmp == 0; col++)
{
- /* Compare prefix values (assumes pass-by-value int4 for now) */
- int32 prefix_a = DatumGetInt32(cursor_a->prefix_value);
- int32 prefix_b = DatumGetInt32(cursor_b->prefix_value);
-
- if (prefix_a < prefix_b)
- cmp = -1;
- else if (prefix_a > prefix_b)
- cmp = 1;
+ int attno = state->num_prefix_cols + col + 1;
+ int16 indoption = state->index_rel->rd_indoption[attno - 1];
+ bool null_a,
+ null_b;
+ Datum key_a,
+ key_b;
+
+ key_a = index_getattr(itup_a, attno, state->index_tupdesc, &null_a);
+ key_b = index_getattr(itup_b, attno, state->index_tupdesc, &null_b);
+
+ /* Handle NULLs - return directly with all factors multiplied */
+ if (null_a || null_b)
+ {
+ if (null_a && null_b)
+ continue; /* Both NULL, try next column */
+
+ return (null_a ? -1 : 1)
+ * ((indoption & INDOPTION_NULLS_FIRST) ? -1 : 1)
+ * (state->direction == BackwardScanDirection ? -1 : 1);
+ }
+
+ /* Compare using index's comparison function and collation */
+ cmp = DatumGetInt32(FunctionCall2Coll(index_getprocinfo(state->index_rel, attno, BTORDER_PROC),
+ TupleDescAttr(state->index_tupdesc, attno - 1)->attcollation,
+ key_a, key_b));
+
+ /* For DESC columns, invert to match physical index order */
+ if ((indoption & INDOPTION_DESC))
+ cmp = -cmp;
}
+ /* For backward scan, invert the suffix comparison */
+ if (state->direction == BackwardScanDirection)
+ cmp = -cmp;
+
+ /* Use cursor_id as tiebreaker (always ascending for determinism) */
+ if (cmp == 0)
+ cmp = cursor_a->cursor_id - cursor_b->cursor_id;
+
/* Negate for min-heap behavior */
return -cmp;
}
@@ -116,24 +157,32 @@ bt_merge_heap_cmp(const pairingheap_node *a,
* bt_merge_init
* Initialize a merge scan state.
*
- * Creates the merge state with one cursor per prefix value.
+ * Creates the merge state with one cursor per prefix combination.
* The cursors will be positioned at their first matching tuples
* when bt_merge_getnext is first called.
+ *
+ * Prefix columns are assumed to be 1..num_prefix_cols.
+ * Suffix columns are (num_prefix_cols+1)..indnkeyatts.
+ * Comparison functions are looked up from the index relation.
*/
BTMergeScanState *
bt_merge_init(IndexScanDesc scan,
- Datum *prefix_values,
- bool *prefix_nulls,
- int num_prefixes,
- int prefix_attno,
- int suffix_attno,
- Oid suffix_cmp_oid,
- Oid suffix_collation)
+ Datum **prefix_tuples,
+ bool **prefix_nulls,
+ int num_cursors,
+ int num_prefix_cols)
{
BTMergeScanState *state;
+ Relation rel = scan->indexRelation;
+ TupleDesc tupdesc = RelationGetDescr(rel);
MemoryContext merge_context;
MemoryContext old_context;
int i;
+ int j;
+
+ /* Check there are suffix columns to order by */
+ if (rel->rd_index->indnkeyatts <= num_prefix_cols)
+ return NULL;
/* Create memory context for merge scan allocations */
merge_context = AllocSetContextCreate(CurrentMemoryContext,
@@ -144,33 +193,57 @@ bt_merge_init(IndexScanDesc scan,
/* Allocate main state structure */
state = palloc0(sizeof(BTMergeScanState));
state->merge_context = merge_context;
- state->num_cursors = num_prefixes;
+ state->num_cursors = num_cursors;
state->active_cursors = 0;
- state->prefix_attno = prefix_attno;
- state->suffix_attno = suffix_attno;
- state->suffix_collation = suffix_collation;
+ state->num_prefix_cols = num_prefix_cols;
state->direction = ForwardScanDirection;
state->initialized = false;
state->tuples_accessed = 0;
+ state->index_tupdesc = tupdesc;
- /* Set up suffix comparison function */
- fmgr_info(suffix_cmp_oid, &state->suffix_cmp);
+ /* Store reference to index relation (for cmp funcs, collations, indoption) */
+ state->index_rel = rel;
/* Allocate cursor array */
- state->cursors = palloc0(num_prefixes * sizeof(BTMergeCursor));
+ state->cursors = palloc0(num_cursors * sizeof(BTMergeCursor));
/* Initialize cursor metadata (not positioned yet) */
- for (i = 0; i < num_prefixes; i++)
+ for (i = 0; i < num_cursors; i++)
{
BTMergeCursor *cursor = &state->cursors[i];
+ bool has_null = false;
cursor->cursor_id = i;
- cursor->prefix_value = datumCopy(prefix_values[i], true, sizeof(Datum));
- cursor->prefix_isnull = prefix_nulls[i];
- cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
- cursor->sort_key_isnull = true;
+
+ /* Check if any prefix value is NULL */
+ for (j = 0; j < num_prefix_cols; j++)
+ {
+ if (prefix_nulls[i][j])
+ {
+ has_null = true;
+ break;
+ }
+ }
+
+ /* Skip cursors with NULL prefixes - they would match nothing */
+ if (has_null)
+ {
+ cursor->prefix_values = NULL;
+ cursor->exhausted = true;
+ cursor->tuples = NULL;
+ BTScanPosInvalidate(cursor->pos);
+ continue;
+ }
+
+ /* Copy prefix values for this cursor */
+ cursor->prefix_values = palloc(num_prefix_cols * sizeof(Datum));
+ for (j = 0; j < num_prefix_cols; j++)
+ {
+ cursor->prefix_values[j] = datumCopy(prefix_tuples[i][j], true, sizeof(Datum));
+ }
+ cursor->exhausted = false;
BTScanPosInvalidate(cursor->pos);
- /* Allocate tuple workspace for index-only scans */
+ /* Allocate tuple workspace for suffix key extraction */
cursor->tuples = palloc(BLCKSZ);
}
@@ -212,9 +285,7 @@ bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
{
BTMergeCursor *c = &state->cursors[i];
- if (!c->exhausted &&
- bt_merge_cursor_init(state, scan, c,
- c->prefix_value, c->prefix_isnull))
+ if (!c->exhausted && bt_merge_cursor_init(state, scan, c))
{
/* Cursor has at least one tuple, add to heap */
pairingheap_add(state->merge_heap, &c->ph_node);
@@ -303,33 +374,38 @@ bt_merge_end(BTMergeScanState *state)
static bool
bt_merge_cursor_init(BTMergeScanState *state,
IndexScanDesc scan,
- BTMergeCursor *cursor,
- Datum prefix_value,
- bool prefix_isnull)
+ BTMergeCursor *cursor)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool found;
-
- if (prefix_isnull)
- {
- cursor->exhausted = true;
- return false;
- }
+ bool save_want_itup;
+ int col;
/*
- * Modify the scan key to use this cursor's prefix value.
- * We reuse the scan's existing key infrastructure.
+ * Modify the scan keys to use this cursor's prefix values.
+ * We modify scan->keyData (original keys) because _bt_first calls
+ * _bt_preprocess_keys which re-processes scan->keyData into so->keyData.
+ * Prefix columns are 1..num_prefix_cols.
*/
- for (int i = 0; i < so->numberOfKeys; i++)
+ for (col = 0; col < state->num_prefix_cols; col++)
{
- if (so->keyData[i].sk_attno == state->prefix_attno)
+ int attno = col + 1; /* 1-based attribute number */
+
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
- so->keyData[i].sk_argument = prefix_value;
- so->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
- break;
+ if (scan->keyData[i].sk_attno == attno &&
+ scan->keyData[i].sk_strategy == BTEqualStrategyNumber)
+ {
+ scan->keyData[i].sk_argument = cursor->prefix_values[col];
+ scan->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
+ break;
+ }
}
}
+ /* Force key re-preprocessing for this cursor's prefix values */
+ so->numberOfKeys = 0;
+
/* Invalidate current position to force _bt_first */
BTScanPosInvalidate(so->currPos);
@@ -342,6 +418,14 @@ bt_merge_cursor_init(BTMergeScanState *state,
so->numArrayKeys = 0;
so->needPrimScan = false;
+ /*
+ * Force tuple data to be copied for suffix key extraction.
+ * This is needed even for regular (non-index-only) scans because
+ * the merge comparison function needs access to the suffix column.
+ */
+ save_want_itup = scan->xs_want_itup;
+ scan->xs_want_itup = true;
+
/* Position at first matching tuple */
found = _bt_first(scan, state->direction);
@@ -351,7 +435,7 @@ bt_merge_cursor_init(BTMergeScanState *state,
memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
/*
- * Copy the tuple data for index-only scans.
+ * Copy the tuple data for suffix key extraction during heap comparison.
* The tuple workspace contains copies of index tuples referenced
* by items in currPos.
*/
@@ -360,12 +444,7 @@ bt_merge_cursor_init(BTMergeScanState *state,
memcpy(cursor->tuples, so->currTuples, so->currPos.nextTupleOffset);
}
- /* Extract the sort key for heap ordering */
- cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
- &cursor->sort_key_isnull);
cursor->exhausted = false;
-
- /* Count this as a tuple access */
state->tuples_accessed++;
/* Invalidate main scan position */
@@ -376,6 +455,9 @@ bt_merge_cursor_init(BTMergeScanState *state,
cursor->exhausted = true;
}
+ /* Restore original setting */
+ scan->xs_want_itup = save_want_itup;
+
return found;
}
@@ -423,28 +505,38 @@ bt_merge_cursor_advance(BTMergeScanState *state,
* call _bt_next, then swap back.
*/
BTScanPosData save_pos;
+ bool save_want_itup;
memcpy(&save_pos, &so->currPos, sizeof(BTScanPosData));
memcpy(&so->currPos, &cursor->pos, sizeof(BTScanPosData));
+ /* Force tuple data to be copied for suffix key extraction */
+ save_want_itup = scan->xs_want_itup;
+ scan->xs_want_itup = true;
+
found = _bt_next(scan, state->direction);
if (found)
+ {
memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+ /*
+ * Copy the new page's tuple data for suffix key extraction.
+ */
+ if (so->currTuples && so->currPos.nextTupleOffset > 0)
+ {
+ memcpy(cursor->tuples, so->currTuples, so->currPos.nextTupleOffset);
+ }
+ }
+
+ /* Restore original setting */
+ scan->xs_want_itup = save_want_itup;
+
memcpy(&so->currPos, &save_pos, sizeof(BTScanPosData));
}
if (found)
{
- /*
- * Don't count here - the advanced-to tuple will be returned later
- * and counted by index_getnext_tid at that time.
- */
-
- /* Extract new sort key */
- cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
- &cursor->sort_key_isnull);
state->tuples_accessed++;
}
else
@@ -454,56 +546,3 @@ bt_merge_cursor_advance(BTMergeScanState *state,
return found;
}
-
-
-/*
- * bt_merge_extract_sortkey
- * Extract the sort key (suffix column value) from the current tuple.
- */
-static Datum
-bt_merge_extract_sortkey(BTMergeScanState *state,
- IndexScanDesc scan,
- BTMergeCursor *cursor,
- bool *isnull)
-{
- Relation rel = scan->indexRelation;
- Buffer buf;
- Page page;
- OffsetNumber offnum;
- ItemId itemid;
- IndexTuple itup;
- TupleDesc tupdesc;
- Datum result;
-
- if (cursor->pos.currPage == InvalidBlockNumber)
- {
- *isnull = true;
- return (Datum) 0;
- }
-
- /* Read the page */
- buf = ReadBuffer(rel, cursor->pos.currPage);
- LockBuffer(buf, BT_READ);
- page = BufferGetPage(buf);
-
- offnum = cursor->pos.items[cursor->pos.itemIndex].indexOffset;
- itemid = PageGetItemId(page, offnum);
- itup = (IndexTuple) PageGetItem(page, itemid);
- tupdesc = RelationGetDescr(rel);
-
- /* Extract the suffix column value */
- result = index_getattr(itup, state->suffix_attno, tupdesc, isnull);
-
- /* Copy pass-by-reference values before releasing buffer */
- if (!*isnull)
- {
- Form_pg_attribute attr = TupleDescAttr(tupdesc, state->suffix_attno - 1);
-
- if (!attr->attbyval)
- result = datumCopy(result, attr->attbyval, attr->attlen);
- }
-
- UnlockReleaseBuffer(buf);
-
- return result;
-}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 0e55c4874b4..ee6b6c6783b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -226,102 +226,197 @@ btinsert(Relation rel, Datum *values, bool *isnull,
return result;
}
+/*
+ * PrefixColConstraint - holds constraint info for one prefix column
+ */
+typedef struct PrefixColConstraint
+{
+ int attno; /* attribute number (1-based) */
+ int num_values; /* number of values (1 for equality, N for IN) */
+ Datum *values; /* array of values */
+ bool *nulls; /* array of null flags */
+} PrefixColConstraint;
+
/*
* bt_init_merge_scan_from_keys
- * Initialize merge scan state from the preprocessed scan keys.
+ * Initialize merge scan state from scan keys with multi-column support.
+ *
+ * Handles multiple prefix columns with equality or IN constraints.
+ * Expands Cartesian product of all prefix combinations.
*
* Returns true if merge scan was successfully initialized.
- * Returns false if merge scan cannot be used (e.g., no suitable array key).
+ * Returns false if merge scan cannot be used.
*/
static bool
bt_init_merge_scan_from_keys(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- TupleDesc itupdesc = RelationGetDescr(rel);
- ScanKey arrayKey = NULL;
- ArrayType *arr;
- Datum *prefix_values;
- bool *prefix_nulls;
- int num_prefixes;
- int prefix_attno;
- int suffix_attno;
- Oid suffix_cmp_oid;
- Oid suffix_collation;
- Oid opfamily;
- Oid elemtype;
- int16 elemlen;
- bool elembyval;
- char elemalign;
+ PrefixColConstraint *constraints;
+ int num_prefix_cols;
+ int total_cursors;
+ Datum **prefix_tuples;
+ bool **prefix_nulls;
int i;
+ int j;
+ int col;
- /* Look for SK_SEARCHARRAY on first column in the raw scan keys */
- for (i = 0; i < scan->numberOfKeys; i++)
+ /*
+ * Find prefix columns: all columns with equality/IN constraints before
+ * the suffix column. For now, assume columns 1..N are prefixes if they
+ * have equality constraints, and column N+1 is the suffix.
+ */
+ num_prefix_cols = 0;
+ for (col = 1; col <= rel->rd_index->indnkeyatts; col++)
{
- ScanKey sk = &scan->keyData[i];
+ bool has_equality = false;
- if ((sk->sk_flags & SK_SEARCHARRAY) &&
- sk->sk_attno == 1 &&
- sk->sk_strategy == BTEqualStrategyNumber)
+ for (i = 0; i < scan->numberOfKeys; i++)
{
- arrayKey = sk;
- break;
+ ScanKey sk = &scan->keyData[i];
+
+ if (sk->sk_attno == col &&
+ sk->sk_strategy == BTEqualStrategyNumber)
+ {
+ has_equality = true;
+ break;
+ }
}
+
+ if (has_equality)
+ num_prefix_cols++;
+ else
+ break; /* First column without equality is suffix */
}
- if (arrayKey == NULL)
+ if (num_prefix_cols == 0)
return false;
- /* Extract array values from the scan key */
- arr = DatumGetArrayTypeP(arrayKey->sk_argument);
- num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
-
- if (num_prefixes < 2)
- return false;
+ /* Allocate constraint array */
+ constraints = palloc0(num_prefix_cols * sizeof(PrefixColConstraint));
- /* Get array element type info */
- elemtype = ARR_ELEMTYPE(arr);
- get_typlenbyvalalign(elemtype, &elemlen, &elembyval, &elemalign);
+ /* Collect constraints for each prefix column */
+ total_cursors = 1;
+ for (col = 0; col < num_prefix_cols; col++)
+ {
+ int attno = col + 1;
+ PrefixColConstraint *c = &constraints[col];
- /* Deconstruct the array into individual elements */
- deconstruct_array(arr, elemtype, elemlen, elembyval, elemalign,
- &prefix_values, &prefix_nulls, &num_prefixes);
+ c->attno = attno;
- /* Attribute numbers (1-based) */
- prefix_attno = 1;
- suffix_attno = 2;
+ /* Look for array or scalar equality on this column */
+ for (i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey sk = &scan->keyData[i];
- /* Get the opfamily from the index */
- opfamily = rel->rd_opfamily[suffix_attno - 1];
+ if (sk->sk_attno == attno &&
+ sk->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (sk->sk_flags & SK_SEARCHARRAY)
+ {
+ /* IN clause - extract array elements */
+ ArrayType *arr = DatumGetArrayTypeP(sk->sk_argument);
+ Oid elemtype = ARR_ELEMTYPE(arr);
+ int16 elemlen;
+ bool elembyval;
+ char elemalign;
+
+ get_typlenbyvalalign(elemtype, &elemlen, &elembyval, &elemalign);
+ deconstruct_array(arr, elemtype, elemlen, elembyval, elemalign,
+ &c->values, &c->nulls, &c->num_values);
+ }
+ else
+ {
+ /* Simple equality - single value */
+ c->num_values = 1;
+ c->values = palloc(sizeof(Datum));
+ c->nulls = palloc(sizeof(bool));
+ c->values[0] = sk->sk_argument;
+ c->nulls[0] = (sk->sk_flags & SK_ISNULL) != 0;
+ }
+ break;
+ }
+ }
- /* Get collation from the suffix column */
- suffix_collation = TupleDescAttr(itupdesc, suffix_attno - 1)->attcollation;
+ if (c->num_values == 0)
+ {
+ /* No constraint found - shouldn't happen */
+ pfree(constraints);
+ return false;
+ }
- /* Get the comparison function OID for the suffix column */
- suffix_cmp_oid = get_opfamily_proc(opfamily,
- TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
- TupleDescAttr(itupdesc, suffix_attno - 1)->atttypid,
- BTORDER_PROC);
+ total_cursors *= c->num_values;
+ }
- if (!OidIsValid(suffix_cmp_oid))
+ if (total_cursors < 2)
{
- pfree(prefix_values);
- pfree(prefix_nulls);
+ /* Not enough combinations for merge scan */
+ for (col = 0; col < num_prefix_cols; col++)
+ {
+ if (constraints[col].values)
+ pfree(constraints[col].values);
+ if (constraints[col].nulls)
+ pfree(constraints[col].nulls);
+ }
+ pfree(constraints);
return false;
}
+ /*
+ * Expand Cartesian product of all prefix column values.
+ * Each cursor gets one combination of prefix values.
+ */
+ prefix_tuples = palloc(total_cursors * sizeof(Datum *));
+ prefix_nulls = palloc(total_cursors * sizeof(bool *));
+
+ for (i = 0; i < total_cursors; i++)
+ {
+ int idx = i;
+
+ prefix_tuples[i] = palloc(num_prefix_cols * sizeof(Datum));
+ prefix_nulls[i] = palloc(num_prefix_cols * sizeof(bool));
+
+ /* Compute which value from each column for cursor i */
+ for (j = num_prefix_cols - 1; j >= 0; j--)
+ {
+ int val_idx = idx % constraints[j].num_values;
+
+ prefix_tuples[i][j] = constraints[j].values[val_idx];
+ prefix_nulls[i][j] = constraints[j].nulls[val_idx];
+ idx /= constraints[j].num_values;
+ }
+ }
+
+ /*
+ * Prefix tuples are passed to bt_merge_init in their current order.
+ * The cursor_id assignment preserves this order, which serves as
+ * tiebreaker when suffix values are equal. Future enhancement:
+ * allow executor to sort prefixes by arbitrary expressions.
+ */
+
/* Initialize the merge scan state */
so->mergeState = bt_merge_init(scan,
- prefix_values,
+ prefix_tuples,
prefix_nulls,
- num_prefixes,
- prefix_attno,
- suffix_attno,
- suffix_cmp_oid,
- suffix_collation);
+ total_cursors,
+ num_prefix_cols);
- pfree(prefix_values);
+ /* Cleanup temporary allocations (bt_merge_init copies what it needs) */
+ for (i = 0; i < total_cursors; i++)
+ {
+ pfree(prefix_tuples[i]);
+ pfree(prefix_nulls[i]);
+ }
+ pfree(prefix_tuples);
pfree(prefix_nulls);
+ for (col = 0; col < num_prefix_cols; col++)
+ {
+ if (constraints[col].values)
+ pfree(constraints[col].values);
+ if (constraints[col].nulls)
+ pfree(constraints[col].nulls);
+ }
+ pfree(constraints);
return (so->mergeState != NULL);
}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..1e2c3d5f9fb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -87,6 +87,10 @@ static void show_qual(List *qual, const char *qlabel,
static void show_scan_qual(List *qual, const char *qlabel,
PlanState *planstate, List *ancestors,
ExplainState *es);
+static void show_index_qual_with_prefix(List *suffix_qual, List *prefix_qual,
+ List *default_qual,
+ PlanState *planstate, List *ancestors,
+ ExplainState *es);
static void show_upper_qual(List *qual, const char *qlabel,
PlanState *planstate, List *ancestors,
ExplainState *es);
@@ -1961,35 +1965,47 @@ ExplainNode(PlanState *planstate, List *ancestors,
switch (nodeTag(plan))
{
case T_IndexScan:
- show_scan_qual(((IndexScan *) plan)->indexqualorig,
- "Index Cond", planstate, ancestors, es);
- if (((IndexScan *) plan)->indexqualorig)
- show_instrumentation_count("Rows Removed by Index Recheck", 2,
- planstate, es);
- show_scan_qual(((IndexScan *) plan)->indexorderbyorig,
- "Order By", planstate, ancestors, es);
- show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
- if (plan->qual)
- show_instrumentation_count("Rows Removed by Filter", 1,
- planstate, es);
- show_indexsearches_info(planstate, es);
+ {
+ IndexScan *iscan = (IndexScan *) plan;
+
+ show_index_qual_with_prefix(iscan->indexqualorig,
+ iscan->indexprefixqual,
+ iscan->indexqualorig,
+ planstate, ancestors, es);
+ if (iscan->indexqualorig)
+ show_instrumentation_count("Rows Removed by Index Recheck", 2,
+ planstate, es);
+ show_scan_qual(iscan->indexorderbyorig,
+ "Order By", planstate, ancestors, es);
+ show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (plan->qual)
+ show_instrumentation_count("Rows Removed by Filter", 1,
+ planstate, es);
+ show_indexsearches_info(planstate, es);
+ }
break;
case T_IndexOnlyScan:
- show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
- "Index Cond", planstate, ancestors, es);
- if (((IndexOnlyScan *) plan)->recheckqual)
- show_instrumentation_count("Rows Removed by Index Recheck", 2,
- planstate, es);
- show_scan_qual(((IndexOnlyScan *) plan)->indexorderby,
- "Order By", planstate, ancestors, es);
- show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
- if (plan->qual)
- show_instrumentation_count("Rows Removed by Filter", 1,
- planstate, es);
- if (es->analyze)
- ExplainPropertyFloat("Heap Fetches", NULL,
- planstate->instrument->ntuples2, 0, es);
- show_indexsearches_info(planstate, es);
+ {
+ IndexOnlyScan *ioscan = (IndexOnlyScan *) plan;
+
+ show_index_qual_with_prefix(ioscan->recheckqual,
+ ioscan->indexprefixqual,
+ ioscan->indexqual,
+ planstate, ancestors, es);
+ if (ioscan->recheckqual)
+ show_instrumentation_count("Rows Removed by Index Recheck", 2,
+ planstate, es);
+ show_scan_qual(ioscan->indexorderby,
+ "Order By", planstate, ancestors, es);
+ show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+ if (plan->qual)
+ show_instrumentation_count("Rows Removed by Filter", 1,
+ planstate, es);
+ if (es->analyze)
+ ExplainPropertyFloat("Heap Fetches", NULL,
+ planstate->instrument->ntuples2, 0, es);
+ show_indexsearches_info(planstate, es);
+ }
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2555,6 +2571,30 @@ show_scan_qual(List *qual, const char *qlabel,
show_qual(qual, qlabel, planstate, ancestors, useprefix, es);
}
+/*
+ * Show index quals with optional prefix separation for merge scans.
+ *
+ * For merge scans, shows "Index Cond" (suffix_qual) and "Index Prefixes"
+ * (prefix_qual) separately. For regular scans, shows default_qual as
+ * "Index Cond".
+ */
+static void
+show_index_qual_with_prefix(List *suffix_qual, List *prefix_qual,
+ List *default_qual,
+ PlanState *planstate, List *ancestors,
+ ExplainState *es)
+{
+ if (prefix_qual)
+ {
+ show_scan_qual(suffix_qual, "Index Cond", planstate, ancestors, es);
+ show_scan_qual(prefix_qual, "Index Prefixes", planstate, ancestors, es);
+ }
+ else
+ {
+ show_scan_qual(default_qual, "Index Cond", planstate, ancestors, es);
+ }
+}
+
/*
* Show a qualifier expression for an upper-level plan node
*/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c30d6e84672..1567551f9dd 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -145,6 +145,7 @@ int max_parallel_workers_per_gather = 2;
bool enable_seqscan = true;
bool enable_indexscan = true;
bool enable_indexonlyscan = true;
+bool enable_indexmergescan = true;
bool enable_bitmapscan = true;
bool enable_tidscan = true;
bool enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 44b79f91335..55d635e9524 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,44 +784,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
}
/*
- * consider_merge_scan_path
- * Check if this index can provide a merge scan path for queries of the form:
- * WHERE prefix IN (...) AND suffix >= b ORDER BY suffix, prefix LIMIT N
+ * count_equality_values
+ * Count the number of equality values for index clauses on a column.
*
- * Merge scan allows lazily producing output sorted by (suffix, prefix) from
- * an index on (prefix, suffix) by doing a K-way merge of K separate scans.
+ * Returns 1 for simple equality, N for IN-list with N elements, 0 if none.
*/
-static void
-consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
- IndexOptInfo *index, IndexClauseSet *clauses)
+static int
+count_equality_values(List *indexclauses)
{
- IndexPath *ipath;
- List *index_clauses;
- List *index_pathkeys;
- List *merge_pathkeys;
ListCell *lc;
- int num_prefixes = 0;
- int indexcol;
- bool has_saop_on_first = false;
- bool has_clause_on_second = false;
- /* Need at least 2 index columns for merge scan */
- if (index->nkeycolumns < 2)
- return;
-
- /* Index must be ordered and support gettuple */
- if (index->sortopfamily == NULL || !index->amhasgettuple)
- return;
-
- /* Must have query pathkeys with at least 2 elements */
- if (root->query_pathkeys == NIL || list_length(root->query_pathkeys) < 2)
- return;
-
- /*
- * Check for ScalarArrayOpExpr on first column.
- * Count the number of array elements (prefix values).
- */
- foreach(lc, clauses->indexclauses[0])
+ foreach(lc, indexclauses)
{
IndexClause *iclause = (IndexClause *) lfirst(lc);
RestrictInfo *rinfo = iclause->rinfo;
@@ -831,9 +804,6 @@ consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
Node *arrayarg = (Node *) lsecond(saop->args);
- has_saop_on_first = true;
-
- /* Try to determine the number of array elements */
if (IsA(arrayarg, Const))
{
Const *con = (Const *) arrayarg;
@@ -841,61 +811,135 @@ consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
if (!con->constisnull)
{
ArrayType *arr = DatumGetArrayTypeP(con->constvalue);
- num_prefixes = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+
+ return ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
}
}
else
{
/* Can't determine size, estimate conservatively */
- num_prefixes = 10;
+ return 10;
}
- break;
+ }
+ else if (IsA(rinfo->clause, OpExpr))
+ {
+ /* Simple equality constraint = 1 value */
+ return 1;
}
}
- if (!has_saop_on_first || num_prefixes < 2)
+ return 0;
+}
+
+/*
+ * consider_merge_scan_path
+ * Check if this index can provide a merge scan path for queries with
+ * equality/IN constraints on prefix columns and ORDER BY on suffix.
+ *
+ * Supports multiple prefix columns:
+ * - a = const AND b IN B -> len(B) cursors
+ * - a IN A AND b IN B -> len(A) * len(B) cursors
+ * - a IN A AND b = const -> len(A) cursors
+ *
+ * Merge scan allows lazily producing output sorted by suffix from an
+ * index on (prefixes..., suffix) by doing K-way merge of K separate scans.
+ */
+static void
+consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
+ IndexOptInfo *index, IndexClauseSet *clauses)
+{
+ IndexPath *ipath;
+ List *index_clauses;
+ List *merge_pathkeys;
+ ListCell *lc;
+ int num_prefixes;
+ int suffix_indexcol;
+ int indexcol;
+ PathKey *query_first_pk;
+ ScanDirection scandirection;
+
+ if (!enable_indexmergescan)
return;
- /* Check if there's any clause on second column */
- if (clauses->indexclauses[1] != NIL)
- has_clause_on_second = true;
+ /* Need at least 2 index columns for merge scan */
+ if (index->nkeycolumns < 2)
+ return;
- if (!has_clause_on_second)
+ /* Index must be ordered and support gettuple */
+ if (index->sortopfamily == NULL || !index->amhasgettuple)
return;
- /*
- * Get the natural index pathkeys (prefix, suffix order).
- * We need at least 2 pathkeys for merge scan to make sense.
- */
- index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection);
- if (list_length(index_pathkeys) < 2)
+ /* Must have query pathkeys */
+ if (root->query_pathkeys == NIL)
return;
/*
- * Check if query pathkeys are (suffix, prefix) - the REVERSED order.
- * query_pathkeys[0] should match index_pathkeys[1] (suffix)
- * query_pathkeys[1] should match index_pathkeys[0] (prefix)
+ * Find the suffix column: the index column (not the first) that matches
+ * the query's first ORDER BY column. We don't use build_index_pathkeys()
+ * because equality-constrained prefix columns don't produce pathkeys.
+ *
+ * Instead, we directly check each index column's expression against the
+ * query's first pathkey equivalence class.
*/
+ query_first_pk = (PathKey *) linitial(root->query_pathkeys);
+ suffix_indexcol = -1;
+
+ for (indexcol = 1; indexcol < index->nkeycolumns; indexcol++)
{
- PathKey *qpk0 = (PathKey *) linitial(root->query_pathkeys);
- PathKey *qpk1 = (PathKey *) lsecond(root->query_pathkeys);
- PathKey *ipk0 = (PathKey *) linitial(index_pathkeys);
- PathKey *ipk1 = (PathKey *) lsecond(index_pathkeys);
+ TargetEntry *indextle = (TargetEntry *) list_nth(index->indextlist, indexcol);
+ EquivalenceMember *em;
+
+ /* Check if this index column is in the query's first pathkey EC */
+ em = find_ec_member_matching_expr(query_first_pk->pk_eclass,
+ indextle->expr,
+ index->rel->relids);
+ if (em != NULL)
+ {
+ suffix_indexcol = indexcol;
+ break;
+ }
+ }
- /* Query's first pathkey must match index's SECOND pathkey (suffix) */
- if (qpk0->pk_eclass != ipk1->pk_eclass)
- return;
+ if (suffix_indexcol < 1)
+ return; /* No suitable suffix column found */
- /* Query's second pathkey must match index's FIRST pathkey (prefix) */
- if (qpk1->pk_eclass != ipk0->pk_eclass)
- return;
+ /*
+ * Determine scan direction based on query's sort direction and index's
+ * natural order. If both match, use forward; if opposite, use backward.
+ */
+ {
+ bool query_is_desc = (query_first_pk->pk_cmptype == COMPARE_GT);
+ bool index_is_desc = index->reverse_sort[suffix_indexcol];
+
+ if (query_is_desc == index_is_desc)
+ scandirection = ForwardScanDirection;
+ else
+ scandirection = BackwardScanDirection;
}
/*
- * The merge scan can satisfy the query's ORDER BY (suffix, prefix).
- * Use the query's pathkeys directly since we've verified they match.
- * This is critical: PostgreSQL compares pathkeys by pointer equality.
+ * Count prefix combinations: product of equality values for all columns
+ * before the suffix column. Each column must have equality constraint.
*/
+ num_prefixes = 1;
+ for (indexcol = 0; indexcol < suffix_indexcol; indexcol++)
+ {
+ int col_count = count_equality_values(clauses->indexclauses[indexcol]);
+
+ if (col_count == 0)
+ return; /* Gap in prefix - can't use merge scan */
+
+ num_prefixes *= col_count;
+ }
+
+ if (num_prefixes < 2)
+ return; /* Need at least 2 cursors for merge scan */
+
+ /* Must have a clause on the suffix column */
+ if (clauses->indexclauses[suffix_indexcol] == NIL)
+ return;
+
+ /* Use query pathkeys for pointer equality */
merge_pathkeys = root->query_pathkeys;
/*
@@ -907,19 +951,20 @@ consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
foreach(lc, clauses->indexclauses[indexcol])
{
IndexClause *iclause = (IndexClause *) lfirst(lc);
+
index_clauses = lappend(index_clauses, iclause);
}
}
/*
- * Create the merge scan path with (suffix, prefix) pathkeys.
+ * Create the merge scan path with query's pathkeys.
*/
ipath = create_index_path(root, index,
index_clauses,
NIL, /* no ORDER BY expressions */
NIL, /* no ORDER BY columns */
merge_pathkeys,
- ForwardScanDirection,
+ scandirection,
check_index_only(rel, index),
NULL, /* no outer relids */
1.0, /* loop_count */
@@ -927,11 +972,11 @@ consider_merge_scan_path(PlannerInfo *root, RelOptInfo *rel,
/* Enable merge scan with K-way merge */
ipath->num_merge_prefixes = num_prefixes;
+ ipath->suffix_indexcol = suffix_indexcol;
/*
* Adjust costs and row estimate for merge scan.
* Merge scan reads exactly (limit + K - 1) tuples instead of all matching.
- * The row estimate reflects actual tuple accesses, not total matches.
*/
if (root->limit_tuples > 0 && root->limit_tuples < ipath->path.rows)
{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 485b4b3e54e..7f7d9c26045 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -181,11 +181,11 @@ static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
TableSampleClause *tsc);
static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
- Oid indexid, List *indexqual, List *indexqualorig,
- List *indexorderby, List *indexorderbyorig,
- List *indexorderbyops,
- int num_merge_prefixes,
- ScanDirection indexscandir);
+ Oid indexid, List *indexqual, List *indexqualorig,
+ List *indexorderby, List *indexorderbyorig,
+ List *indexorderbyops,
+ int num_merge_prefixes,
+ ScanDirection indexscandir);
static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
Index scanrelid, Oid indexid,
List *indexqual, List *recheckqual,
@@ -193,6 +193,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
List *indextlist,
int num_merge_prefixes,
ScanDirection indexscandir);
+static void set_merge_scan_qual_info(Scan *scan_plan, IndexPath *best_path,
+ List *stripped_indexquals, bool indexonly);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -3026,6 +3028,9 @@ create_indexscan_plan(PlannerInfo *root,
best_path->num_merge_prefixes,
best_path->indexscandir);
+ /* For merge scan, separate prefix and suffix quals for EXPLAIN */
+ set_merge_scan_qual_info(scan_plan, best_path, stripped_indexquals, indexonly);
+
copy_generic_path_info(&scan_plan->plan, &best_path->path);
return scan_plan;
@@ -5585,6 +5590,75 @@ make_indexonlyscan(List *qptlist,
return node;
}
+/*
+ * set_merge_scan_qual_info
+ * For merge scan, extract prefix quals for EXPLAIN output.
+ *
+ * Prefix quals are those on index columns before suffix_indexcol.
+ * This separates the equality/IN constraints (prefixes) from the
+ * range constraint (suffix) to make EXPLAIN output clearer.
+ */
+static void
+set_merge_scan_qual_info(Scan *scan_plan, IndexPath *best_path,
+ List *stripped_indexquals, bool indexonly)
+{
+ List *prefix_quals = NIL;
+ List *suffix_quals = NIL;
+ ListCell *lc;
+
+ /* Only process if this is a merge scan */
+ if (best_path->num_merge_prefixes <= 0 || best_path->suffix_indexcol < 0)
+ return;
+
+ /*
+ * Partition quals into prefix (columns before suffix) and suffix.
+ * We match each qual against the IndexClauses to determine which
+ * index column it references.
+ */
+ foreach(lc, stripped_indexquals)
+ {
+ Node *clause = (Node *) lfirst(lc);
+ bool is_prefix = false;
+ ListCell *ic;
+
+ foreach(ic, best_path->indexclauses)
+ {
+ IndexClause *iclause = (IndexClause *) lfirst(ic);
+
+ if (iclause->indexcol < best_path->suffix_indexcol &&
+ equal(clause, iclause->rinfo->clause))
+ {
+ is_prefix = true;
+ break;
+ }
+ }
+
+ if (is_prefix)
+ prefix_quals = lappend(prefix_quals, clause);
+ else
+ suffix_quals = lappend(suffix_quals, clause);
+ }
+
+ /* Store the separated quals in the plan node.
+ * Prefix quals (equality/IN) don't need rechecking since they're exact
+ * matches, so we only store suffix quals in recheckqual/indexqualorig.
+ */
+ if (indexonly)
+ {
+ IndexOnlyScan *ios = (IndexOnlyScan *) scan_plan;
+
+ ios->indexprefixqual = prefix_quals;
+ ios->recheckqual = suffix_quals;
+ }
+ else
+ {
+ IndexScan *iscan = (IndexScan *) scan_plan;
+
+ iscan->indexprefixqual = prefix_quals;
+ iscan->indexqualorig = suffix_quals;
+ }
+}
+
static BitmapIndexScan *
make_bitmap_indexscan(Index scanrelid,
Oid indexid,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 21746cd684c..ed5993cb49d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1076,6 +1076,7 @@ create_index_path(PlannerInfo *root,
pathnode->indexscandir = indexscandir;
pathnode->num_merge_prefixes = 0;
+ pathnode->suffix_indexcol = -1;
cost_index(pathnode, root, loop_count, partial_path);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..0678fe5741b 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -877,6 +877,13 @@
boot_val => 'true',
},
+{ name => 'enable_indexmergescan', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+ short_desc => 'Enables the planner\'s use of index merge-scan plans.',
+ flags => 'GUC_EXPLAIN',
+ variable => 'enable_indexmergescan',
+ boot_val => 'true',
+},
+
{ name => 'enable_indexonlyscan', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
short_desc => 'Enables the planner\'s use of index-only-scan plans.',
flags => 'GUC_EXPLAIN',
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 0d4e7440760..0dff24ac151 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1052,20 +1052,20 @@ typedef struct BTArrayKeyInfo
} BTArrayKeyInfo;
/*
- * BTMergeCursor - tracks scan state for one prefix value in merge scan
+ * BTMergeCursor - tracks scan state for one prefix in merge scan
*
* Each cursor maintains its own position within the index for a specific
- * prefix value. Cursors are organized in a min-heap ordered by their
- * current suffix key value for efficient K-way merge.
+ * prefix values. Cursors are organized in a min-heap ordered
+ * by their current suffix key value for efficient K-way merge.
+ *
+ * Note: cursors with any NULL prefix are marked exhausted (they would match nothing).
+ * The suffix key is extracted on-demand from the tuple data during comparison.
*/
typedef struct BTMergeCursor
{
pairingheap_node ph_node; /* pairing heap node for merge */
int cursor_id; /* index in merge state's cursors array */
- Datum prefix_value; /* the prefix value for this sub-scan */
- bool prefix_isnull; /* is prefix value NULL? */
- Datum sort_key; /* current tuple's sort key (suffix) */
- bool sort_key_isnull;/* is sort key NULL? */
+ Datum *prefix_values; /* array of prefix values for this sub-scan */
bool exhausted; /* no more tuples for this prefix */
BTScanPosData pos; /* current position in index */
char *tuples; /* tuple storage workspace (BLCKSZ) */
@@ -1080,18 +1080,17 @@ typedef struct BTMergeCursor
*/
typedef struct BTMergeScanState
{
- int num_cursors; /* number of prefix values (K) */
+ int num_cursors; /* number of prefix combinations (K) */
int active_cursors; /* cursors not yet exhausted */
BTMergeCursor *cursors; /* array of cursors */
- pairingheap *merge_heap; /* min-heap ordered by sort_key */
- int prefix_attno; /* attribute number of prefix column (1-based) */
- int suffix_attno; /* attribute number of suffix column (1-based) */
- FmgrInfo suffix_cmp; /* comparison function for suffix */
- Oid suffix_collation; /* collation for suffix comparison */
+ pairingheap *merge_heap; /* min-heap ordered by suffix key */
+ int num_prefix_cols;/* number of prefix columns (attno 1..N) */
ScanDirection direction; /* scan direction */
bool initialized; /* have cursors been initialized? */
MemoryContext merge_context;/* memory context for allocations */
int64 tuples_accessed;/* count of index tuples accessed */
+ Relation index_rel; /* index relation (for cmp funcs, indoption) */
+ TupleDesc index_tupdesc; /* index tuple descriptor for suffix extraction */
} BTMergeScanState;
typedef struct BTScanOpaqueData
@@ -1388,13 +1387,10 @@ extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
* prototypes for functions in nbtmergescan.c
*/
extern BTMergeScanState *bt_merge_init(IndexScanDesc scan,
- Datum *prefix_values,
- bool *prefix_nulls,
- int num_prefixes,
- int prefix_attno,
- int suffix_attno,
- Oid suffix_cmp_oid,
- Oid suffix_collation);
+ Datum **prefix_tuples,
+ bool **prefix_nulls,
+ int num_cursors,
+ int num_prefix_cols);
extern bool bt_merge_getnext(IndexScanDesc scan, ScanDirection dir);
extern void bt_merge_end(BTMergeScanState *state);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index ced7e224a87..d7a40995213 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2041,6 +2041,7 @@ typedef struct IndexPath
Cost indextotalcost;
Selectivity indexselectivity;
int num_merge_prefixes;
+ int suffix_indexcol;
} IndexPath;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 86d8c92e01f..1725542744f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -599,6 +599,8 @@ typedef struct IndexScan
ScanDirection indexorderdir;
/* Merge scan: K-way merge */
int num_merge_prefixes;
+ /* Merge scan: constraints on prefix columns for EXPLAIN */
+ List *indexprefixqual;
} IndexScan;
/* ----------------
@@ -649,6 +651,8 @@ typedef struct IndexOnlyScan
ScanDirection indexorderdir;
/* Merge scan: K-way merge */
int num_merge_prefixes;
+ /* Merge scan: prefix quals (equality/IN on prefix columns) for EXPLAIN */
+ List *indexprefixqual;
} IndexOnlyScan;
/* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d31507..a32cac4d0c7 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
extern PGDLLIMPORT bool enable_seqscan;
extern PGDLLIMPORT bool enable_indexscan;
extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexmergescan;
extern PGDLLIMPORT bool enable_bitmapscan;
extern PGDLLIMPORT bool enable_tidscan;
extern PGDLLIMPORT bool enable_sort;
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
index 28509b331d7..a1e69e894ab 100644
--- a/src/test/regress/expected/btree_merge.out
+++ b/src/test/regress/expected/btree_merge.out
@@ -82,26 +82,27 @@ SHOW track_counts; -- should be 'on'
on
(1 row)
--- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+-- Verify merge scan is used: no Sort node when ORDER BY suffix only
+-- K = 8 prefixes, LIMIT 3 -> reads at most 3 + 8 - 1 = 10 tuples
EXPLAIN (COSTS OFF)
SELECT x, y
FROM btree_merge_test
WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
-ORDER BY y, x
+ORDER BY y
LIMIT 3;
- QUERY PLAN
-------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------
Limit
-> Index Only Scan using btree_merge_test_idx on btree_merge_test
- Index Cond: ((x = ANY ('{1,2,5,8,13,21,34,55}'::integer[])) AND (y >= 19))
-(3 rows)
+ Index Cond: (y >= 19)
+ Index Prefixes: (x = ANY ('{1,2,5,8,13,21,34,55}'::integer[]))
+(4 rows)
--- From the limited query proposition this can be computed with 10
--- tupple accesses.
+-- Verify the query produces correct results (sorted by y)
SELECT x, y
FROM btree_merge_test
WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
-ORDER BY y, x -- sort x to make result unique
+ORDER BY y
LIMIT 3;
x | y
---+----
@@ -125,3 +126,262 @@ WHERE indexrelname = 'btree_merge_test_idx';
(1 row)
DROP TABLE btree_merge_test;
+-- ============================================
+-- Multi-column prefix tests
+-- ============================================
+-- Create a 3-column table for multi-prefix testing
+CREATE TABLE btree_merge_multi AS (
+ SELECT a, b, c FROM
+ generate_series(1, 10) AS a,
+ generate_series(1, 10) AS b,
+ generate_series(1, 20) AS c
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_multi_idx ON btree_merge_multi USING btree (a, b, c);
+ANALYSE btree_merge_multi;
+-- Test 1: a = const AND b IN B -> 3 cursors (just the IN list)
+-- Merge scan triggered, no Sort node when ORDER BY suffix only
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a = 1 AND b IN (1, 2, 3) AND c >= 5
+ORDER BY c
+LIMIT 3;
+ QUERY PLAN
+------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (c >= 5)
+ Index Prefixes: ((a = 1) AND (b = ANY ('{1,2,3}'::integer[])))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a = 1 AND b IN (1, 2, 3) AND c >= 5
+ORDER BY c
+LIMIT 3;
+ a | b | c
+---+---+---
+ 1 | 1 | 5
+ 1 | 2 | 5
+ 1 | 3 | 5
+(3 rows)
+
+-- Test 2: a IN A AND b IN B -> len(A) * len(B) cursors (Cartesian product)
+-- With a IN (1,2), b IN (1,2,3), ORDER BY c LIMIT 4
+-- Should use 6 cursors (2*3), no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2) AND b IN (1, 2, 3) AND c >= 10
+ORDER BY c
+LIMIT 4;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (c >= 10)
+ Index Prefixes: ((a = ANY ('{1,2}'::integer[])) AND (b = ANY ('{1,2,3}'::integer[])))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2) AND b IN (1, 2, 3) AND c >= 10
+ORDER BY c
+LIMIT 4;
+ a | b | c
+---+---+----
+ 1 | 1 | 10
+ 1 | 2 | 10
+ 1 | 3 | 10
+ 2 | 1 | 10
+(4 rows)
+
+-- Test 3: a IN A AND b = const -> len(A) cursors
+-- With a IN (1,2,3,4), b=5, ORDER BY c LIMIT 2
+-- Should use 4 cursors, no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3, 4) AND b = 5 AND c >= 8
+ORDER BY c
+LIMIT 2;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (c >= 8)
+ Index Prefixes: ((a = ANY ('{1,2,3,4}'::integer[])) AND (b = 5))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3, 4) AND b = 5 AND c >= 8
+ORDER BY c
+LIMIT 2;
+ a | b | c
+---+---+---
+ 1 | 5 | 8
+ 2 | 5 | 8
+(2 rows)
+
+-- Test 4: Backward scan direction (ORDER BY DESC)
+-- With a IN (1,2,3), b IN (1,2), ORDER BY c DESC LIMIT 3
+-- Should use 6 cursors (3*2), no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b IN (1, 2) AND c <= 15
+ORDER BY c DESC
+LIMIT 3;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan Backward using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (c <= 15)
+ Index Prefixes: ((a = ANY ('{1,2,3}'::integer[])) AND (b = ANY ('{1,2}'::integer[])))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b IN (1, 2) AND c <= 15
+ORDER BY c DESC
+LIMIT 3;
+ a | b | c
+---+---+----
+ 1 | 1 | 15
+ 1 | 2 | 15
+ 2 | 1 | 15
+(3 rows)
+
+-- =================================================================
+-- Multi-column suffix tests
+-- Index is on (a, b, c), testing with prefix on 'a' only
+-- =================================================================
+-- Test 5: ORDER BY b (single column suffix)
+-- With a IN (1,2,3), ORDER BY b LIMIT 6
+-- Prefix: a, Suffix: b
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b
+LIMIT 6;
+ QUERY PLAN
+------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (b >= 1)
+ Index Prefixes: (a = ANY ('{1,2,3}'::integer[]))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b
+LIMIT 6;
+ a | b | c
+---+---+---
+ 1 | 1 | 1
+ 2 | 1 | 1
+ 3 | 1 | 1
+ 1 | 1 | 2
+ 2 | 1 | 2
+ 3 | 1 | 2
+(6 rows)
+
+-- Test 6: ORDER BY b DESC (single column suffix, backward)
+-- With a IN (1,2,3), ORDER BY b DESC LIMIT 6
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC
+LIMIT 6;
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan Backward using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (b <= 10)
+ Index Prefixes: (a = ANY ('{1,2,3}'::integer[]))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC
+LIMIT 6;
+ a | b | c
+---+----+----
+ 1 | 10 | 20
+ 2 | 10 | 20
+ 3 | 10 | 20
+ 1 | 10 | 19
+ 2 | 10 | 19
+ 3 | 10 | 19
+(6 rows)
+
+-- Test 7: ORDER BY b, c (multi-column suffix)
+-- With a IN (1,2,3), ORDER BY b, c LIMIT 6
+-- Prefix: a, Suffix: (b, c)
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b, c
+LIMIT 6;
+ QUERY PLAN
+------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (b >= 1)
+ Index Prefixes: (a = ANY ('{1,2,3}'::integer[]))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b, c
+LIMIT 6;
+ a | b | c
+---+---+---
+ 1 | 1 | 1
+ 2 | 1 | 1
+ 3 | 1 | 1
+ 1 | 1 | 2
+ 2 | 1 | 2
+ 3 | 1 | 2
+(6 rows)
+
+-- Test 8: ORDER BY b DESC, c DESC (multi-column suffix, backward)
+-- With a IN (1,2,3), ORDER BY b DESC, c DESC LIMIT 6
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC, c DESC
+LIMIT 6;
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Limit
+ -> Index Only Scan Backward using btree_merge_multi_idx on btree_merge_multi
+ Index Cond: (b <= 10)
+ Index Prefixes: (a = ANY ('{1,2,3}'::integer[]))
+(4 rows)
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC, c DESC
+LIMIT 6;
+ a | b | c
+---+----+----
+ 1 | 10 | 20
+ 2 | 10 | 20
+ 3 | 10 | 20
+ 1 | 10 | 19
+ 2 | 10 | 19
+ 3 | 10 | 19
+(6 rows)
+
+DROP TABLE btree_merge_multi;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
index ad9cf03f869..792159b0c17 100644
--- a/src/test/regress/sql/btree_merge.sql
+++ b/src/test/regress/sql/btree_merge.sql
@@ -82,20 +82,20 @@ SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
SHOW track_counts; -- should be 'on'
--- Verify merge scan is used: no Sort node, rows=10 (N + K - 1 = 3 + 8 - 1)
+-- Verify merge scan is used: no Sort node when ORDER BY suffix only
+-- K = 8 prefixes, LIMIT 3 -> reads at most 3 + 8 - 1 = 10 tuples
EXPLAIN (COSTS OFF)
SELECT x, y
FROM btree_merge_test
WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
-ORDER BY y, x
+ORDER BY y
LIMIT 3;
--- From the limited query proposition this can be computed with 10
--- tupple accesses.
+-- Verify the query produces correct results (sorted by y)
SELECT x, y
FROM btree_merge_test
WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
-ORDER BY y, x -- sort x to make result unique
+ORDER BY y
LIMIT 3;
@@ -106,4 +106,151 @@ SELECT idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexrelname = 'btree_merge_test_idx';
-DROP TABLE btree_merge_test;
\ No newline at end of file
+DROP TABLE btree_merge_test;
+
+-- ============================================
+-- Multi-column prefix tests
+-- ============================================
+
+-- Create a 3-column table for multi-prefix testing
+CREATE TABLE btree_merge_multi AS (
+ SELECT a, b, c FROM
+ generate_series(1, 10) AS a,
+ generate_series(1, 10) AS b,
+ generate_series(1, 20) AS c
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_multi_idx ON btree_merge_multi USING btree (a, b, c);
+ANALYSE btree_merge_multi;
+
+-- Test 1: a = const AND b IN B -> 3 cursors (just the IN list)
+-- Merge scan triggered, no Sort node when ORDER BY suffix only
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a = 1 AND b IN (1, 2, 3) AND c >= 5
+ORDER BY c
+LIMIT 3;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a = 1 AND b IN (1, 2, 3) AND c >= 5
+ORDER BY c
+LIMIT 3;
+
+-- Test 2: a IN A AND b IN B -> len(A) * len(B) cursors (Cartesian product)
+-- With a IN (1,2), b IN (1,2,3), ORDER BY c LIMIT 4
+-- Should use 6 cursors (2*3), no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2) AND b IN (1, 2, 3) AND c >= 10
+ORDER BY c
+LIMIT 4;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2) AND b IN (1, 2, 3) AND c >= 10
+ORDER BY c
+LIMIT 4;
+
+-- Test 3: a IN A AND b = const -> len(A) cursors
+-- With a IN (1,2,3,4), b=5, ORDER BY c LIMIT 2
+-- Should use 4 cursors, no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3, 4) AND b = 5 AND c >= 8
+ORDER BY c
+LIMIT 2;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3, 4) AND b = 5 AND c >= 8
+ORDER BY c
+LIMIT 2;
+
+-- Test 4: Backward scan direction (ORDER BY DESC)
+-- With a IN (1,2,3), b IN (1,2), ORDER BY c DESC LIMIT 3
+-- Should use 6 cursors (3*2), no Sort node needed
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b IN (1, 2) AND c <= 15
+ORDER BY c DESC
+LIMIT 3;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b IN (1, 2) AND c <= 15
+ORDER BY c DESC
+LIMIT 3;
+
+-- =================================================================
+-- Multi-column suffix tests
+-- Index is on (a, b, c), testing with prefix on 'a' only
+-- =================================================================
+
+-- Test 5: ORDER BY b (single column suffix)
+-- With a IN (1,2,3), ORDER BY b LIMIT 6
+-- Prefix: a, Suffix: b
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b
+LIMIT 6;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b
+LIMIT 6;
+
+-- Test 6: ORDER BY b DESC (single column suffix, backward)
+-- With a IN (1,2,3), ORDER BY b DESC LIMIT 6
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC
+LIMIT 6;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC
+LIMIT 6;
+
+-- Test 7: ORDER BY b, c (multi-column suffix)
+-- With a IN (1,2,3), ORDER BY b, c LIMIT 6
+-- Prefix: a, Suffix: (b, c)
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b, c
+LIMIT 6;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b >= 1
+ORDER BY b, c
+LIMIT 6;
+
+-- Test 8: ORDER BY b DESC, c DESC (multi-column suffix, backward)
+-- With a IN (1,2,3), ORDER BY b DESC, c DESC LIMIT 6
+EXPLAIN (COSTS OFF)
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC, c DESC
+LIMIT 6;
+
+SELECT a, b, c
+FROM btree_merge_multi
+WHERE a IN (1, 2, 3) AND b <= 10
+ORDER BY b DESC, c DESC
+LIMIT 6;
+
+DROP TABLE btree_merge_multi;
\ No newline at end of file
--
2.40.0
[application/octet-stream] 0001-MERGE-SCAN-Test-the-baseline.patch (7.5K, 5-0001-MERGE-SCAN-Test-the-baseline.patch)
download | inline diff:
From 6dc67b16668edc64dd820c5a313c849cd47da6c3 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 08:35:15 +0000
Subject: [PATCH 1/4] [MERGE-SCAN]: Test the baseline
---
src/test/regress/expected/btree_merge.out | 113 ++++++++++++++++++++++
src/test/regress/sql/btree_merge.sql | 100 +++++++++++++++++++
2 files changed, 213 insertions(+)
create mode 100644 src/test/regress/expected/btree_merge.out
create mode 100644 src/test/regress/sql/btree_merge.sql
diff --git a/src/test/regress/expected/btree_merge.out b/src/test/regress/expected/btree_merge.out
new file mode 100644
index 00000000000..441ae1d0657
--- /dev/null
+++ b/src/test/regress/expected/btree_merge.out
@@ -0,0 +1,113 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+ANALYSE btree_merge_test;
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+ track_counts
+--------------
+ on
+(1 row)
+
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+ x | y
+---+----
+ 1 | 19
+ 2 | 19
+ 5 | 19
+(3 rows)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush
+--------------------------
+
+(1 row)
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+ idx_scan | idx_tup_read | idx_tup_fetch
+----------+--------------+---------------
+ 5 | 10 | 10
+(1 row)
+
+DROP TABLE btree_merge_test;
diff --git a/src/test/regress/sql/btree_merge.sql b/src/test/regress/sql/btree_merge.sql
new file mode 100644
index 00000000000..be00c33c2a5
--- /dev/null
+++ b/src/test/regress/sql/btree_merge.sql
@@ -0,0 +1,100 @@
+-- B-Tree Merge Scan Access Method Test
+--
+-- B-Tree Merge Scan is an access method that allows lazily producing
+-- output sorted by a non-leading column when the prefix has few distinct values.
+--
+--
+-- Let S be an infinite set of lattic points (x,y).
+-- Let S(x=1,y>=b) be the sequence of points
+-- SELECT * FROM S WHERE x = a and y >= b ORDER BY b;
+-- i.e. (a, b), (a, b+1), (a, b+2), ...
+-- Similarly, S(x IN X, y=b) being the sequence of points
+-- SELECT * FROM S WHERE x IN X and y = b ORDER BY x;
+-- i.e. (x[1], b), ..., (x[n], b), (x[1], b+1), ...
+-- The output of S(x IN X, y >= b) can be computed as a
+--
+-- Proposition (uncomputable):
+-- S(x, IN X, y >= b) is the K-way merge of the sequences
+-- {S(x=x[i], y >= b), x[i] in X}
+--
+--
+--
+-- Proposition (computable): Bounded suffix
+--
+-- S(x, IN X, b1 <= y <= b2) as bounded
+-- can be computed with (SELECT count(distinct x) + count(1) FROM bounded)
+-- tuple accesses.
+-- (Constructive) Proof:
+-- The result of
+-- SELECT * FROM X
+-- JOIN S on x = x[i] WHERE y BETWEEN b1 AND b2;
+-- is the same as
+-- SELECT * FROM X,
+-- LATERAL (
+-- (SELECT * FROM S
+-- WHERE x = x[i] AND y BETWEEN b1 AND b2
+-- ) AS subscan[i]
+-- ) as merged
+--
+-- Each of subscan[i] is covered by a single range in the index and can
+-- and require at most
+-- (count(1) FROM subscan[i]) + 1 -- subscan tuple access count
+-- tupples to be accessed.
+-- The merged result can be computed using a K-way merge sort
+-- whose number of rows is
+-- sum(count(1) FROM subscan[i]) -- query output rows
+-- Q.E.D.
+--
+--
+-- Proposition (computable): Limitted query
+-- The query
+-- S(x, IN X, y >= b) LIMIT N as limited
+-- Can be computed with at most
+-- N + count(distinct X) - 1
+-- tuple accesses.
+--
+-- (Constructive) Proof:
+-- If an upper `u` bound for `MAX(y IN S(x, IN X, y >= b) LIMIT N)` is known,
+-- then the query can be rewritten as
+-- S(x, IN X, b <= y <= u) LIMIT N
+-- The K-way can produce the next element as soon as it has fetched
+-- the next element for each subquery
+-- 1 row can be produced after count(distinct X) fetches,
+-- After that it can produce one new row for each fetch.
+-- Thus, the total number of fetches is at most
+-- N + count(distinct X) - 1
+-- Q.E.D.
+
+
+-- Generate a table with lattice points
+-- Could be infinite
+CREATE TABLE btree_merge_test AS (
+ SELECT x, y FROM
+ generate_series(1, 50) AS x,
+ generate_series(1, 50) AS y
+ ORDER BY random()
+);
+CREATE INDEX btree_merge_test_idx ON btree_merge_test USING btree (x, y);
+
+ANALYSE btree_merge_test;
+
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+SHOW track_counts; -- should be 'on'
+-- From the limited query proposition this can be computed with 10
+-- tupple accesses.
+SELECT x, y
+FROM btree_merge_test
+WHERE x IN (1,2,5,8,13,21,34,55) AND y >= 19
+ORDER BY y, x -- sort x to make result unique
+LIMIT 3;
+
+
+SELECT pg_stat_force_next_flush();
+
+
+SELECT idx_scan, idx_tup_read, idx_tup_fetch
+FROM pg_stat_user_indexes
+WHERE indexrelname = 'btree_merge_test_idx';
+
+DROP TABLE btree_merge_test;
\ No newline at end of file
--
2.40.0
[application/octet-stream] 0002-MERGE-SCAN-Access-method.patch (49.1K, 6-0002-MERGE-SCAN-Access-method.patch)
download | inline diff:
From d86b371499db011a36583d20963df68b09219190 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 30 Jan 2026 14:27:18 +0000
Subject: [PATCH 2/4] [MERGE-SCAN]: Access method
---
.gitignore | 8 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/meson.build | 1 +
src/backend/access/nbtree/nbtmergescan.c | 457 ++++++++++++++++++
src/include/access/nbtree.h | 64 +++
src/test/modules/meson.build | 1 +
src/test/modules/test_btree_merge/Makefile | 24 +
.../expected/test_btree_merge.out | 243 ++++++++++
src/test/modules/test_btree_merge/meson.build | 33 ++
.../test_btree_merge/sql/test_btree_merge.sql | 207 ++++++++
.../test_btree_merge--1.0.sql | 43 ++
.../test_btree_merge/test_btree_merge.c | 389 +++++++++++++++
.../test_btree_merge/test_btree_merge.control | 5 +
13 files changed, 1476 insertions(+)
create mode 100644 src/backend/access/nbtree/nbtmergescan.c
create mode 100644 src/test/modules/test_btree_merge/Makefile
create mode 100644 src/test/modules/test_btree_merge/expected/test_btree_merge.out
create mode 100644 src/test/modules/test_btree_merge/meson.build
create mode 100644 src/test/modules/test_btree_merge/sql/test_btree_merge.sql
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.c
create mode 100644 src/test/modules/test_btree_merge/test_btree_merge.control
diff --git a/.gitignore b/.gitignore
index 4e911395fe3..ac1f95d9cf0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,11 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+# hidden files (e.g. .dbdata, .install, good practice to test locally in isolation)
+.*
+
+# Test output
+**/regression.diffs
+**/regression.out
+**/results/
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 0daf640af96..72053cefdaa 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -16,6 +16,7 @@ OBJS = \
nbtcompare.o \
nbtdedup.o \
nbtinsert.o \
+ nbtmergescan.o \
nbtpage.o \
nbtpreprocesskeys.o \
nbtreadpage.o \
diff --git a/src/backend/access/nbtree/meson.build b/src/backend/access/nbtree/meson.build
index 812f067e710..1016fea62d5 100644
--- a/src/backend/access/nbtree/meson.build
+++ b/src/backend/access/nbtree/meson.build
@@ -4,6 +4,7 @@ backend_sources += files(
'nbtcompare.c',
'nbtdedup.c',
'nbtinsert.c',
+ 'nbtmergescan.c',
'nbtpage.c',
'nbtpreprocesskeys.c',
'nbtreadpage.c',
diff --git a/src/backend/access/nbtree/nbtmergescan.c b/src/backend/access/nbtree/nbtmergescan.c
new file mode 100644
index 00000000000..70828dc73d3
--- /dev/null
+++ b/src/backend/access/nbtree/nbtmergescan.c
@@ -0,0 +1,457 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtmergescan.c
+ * B-Tree merge scan for efficient evaluation of IN-list queries
+ *
+ * This module implements a K-way merge scan for B-tree indexes, optimized
+ * for queries of the form:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ *
+ * The algorithm maintains a min-heap of cursors, one per prefix value.
+ * Each cursor tracks its position within the index for that prefix.
+ * Tuples are returned in suffix order by repeatedly extracting the
+ * minimum from the heap.
+ *
+ * Target behavior: Access at most N + K - 1 index tuples for LIMIT N.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtmergescan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Forward declarations of static functions */
+static int bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg);
+static bool bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull);
+static bool bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor);
+static Datum bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull);
+
+
+/*
+ * bt_merge_heap_cmp
+ * Compare two cursors by their current sort key (suffix value).
+ *
+ * When sort keys are equal, uses prefix value as tiebreaker for
+ * deterministic ordering (ORDER BY suffix, prefix).
+ *
+ * Returns positive if a > b (pairingheap is a max-heap, we want min-heap
+ * behavior so we invert the comparison).
+ */
+static int
+bt_merge_heap_cmp(const pairingheap_node *a,
+ const pairingheap_node *b,
+ void *arg)
+{
+ BTMergeScanState *state = (BTMergeScanState *) arg;
+ BTMergeCursor *cursor_a = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) a);
+ BTMergeCursor *cursor_b = pairingheap_container(BTMergeCursor, ph_node,
+ (pairingheap_node *) b);
+ Datum key_a = cursor_a->sort_key;
+ Datum key_b = cursor_b->sort_key;
+ bool null_a = cursor_a->sort_key_isnull;
+ bool null_b = cursor_b->sort_key_isnull;
+ int32 cmp;
+
+ /* Handle NULLs - NULLs sort last (NULLS LAST default for ASC) */
+ if (null_a && null_b)
+ return 0;
+ if (null_a)
+ return -1; /* a is NULL, comes after b */
+ if (null_b)
+ return 1; /* b is NULL, comes after a */
+
+ /* Compare using the suffix column's comparison function */
+ cmp = DatumGetInt32(FunctionCall2Coll(&state->suffix_cmp,
+ state->suffix_collation,
+ key_a, key_b));
+
+ /*
+ * Use prefix value as tiebreaker for deterministic ordering.
+ * This ensures ORDER BY suffix, prefix behavior.
+ */
+ if (cmp == 0)
+ {
+ /* Compare prefix values (assumes pass-by-value int4 for now) */
+ int32 prefix_a = DatumGetInt32(cursor_a->prefix_value);
+ int32 prefix_b = DatumGetInt32(cursor_b->prefix_value);
+
+ if (prefix_a < prefix_b)
+ cmp = -1;
+ else if (prefix_a > prefix_b)
+ cmp = 1;
+ }
+
+ /* Negate for min-heap behavior */
+ return -cmp;
+}
+
+
+/*
+ * bt_merge_init
+ * Initialize a merge scan state.
+ *
+ * Creates the merge state with one cursor per prefix value.
+ * The cursors will be positioned at their first matching tuples
+ * when bt_merge_getnext is first called.
+ */
+BTMergeScanState *
+bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation)
+{
+ BTMergeScanState *state;
+ MemoryContext merge_context;
+ MemoryContext old_context;
+ int i;
+
+ /* Create memory context for merge scan allocations */
+ merge_context = AllocSetContextCreate(CurrentMemoryContext,
+ "BTMergeScan",
+ ALLOCSET_DEFAULT_SIZES);
+ old_context = MemoryContextSwitchTo(merge_context);
+
+ /* Allocate main state structure */
+ state = palloc0(sizeof(BTMergeScanState));
+ state->merge_context = merge_context;
+ state->num_cursors = num_prefixes;
+ state->active_cursors = 0;
+ state->prefix_attno = prefix_attno;
+ state->suffix_attno = suffix_attno;
+ state->suffix_collation = suffix_collation;
+ state->direction = ForwardScanDirection;
+ state->initialized = false;
+ state->tuples_accessed = 0;
+
+ /* Set up suffix comparison function */
+ fmgr_info(suffix_cmp_oid, &state->suffix_cmp);
+
+ /* Allocate cursor array */
+ state->cursors = palloc0(num_prefixes * sizeof(BTMergeCursor));
+
+ /* Initialize cursor metadata (not positioned yet) */
+ for (i = 0; i < num_prefixes; i++)
+ {
+ BTMergeCursor *cursor = &state->cursors[i];
+
+ cursor->cursor_id = i;
+ cursor->prefix_value = datumCopy(prefix_values[i], true, sizeof(Datum));
+ cursor->prefix_isnull = prefix_nulls[i];
+ cursor->exhausted = prefix_nulls[i]; /* NULL prefix = exhausted */
+ cursor->sort_key_isnull = true;
+ BTScanPosInvalidate(cursor->pos);
+ cursor->tuples = NULL;
+ }
+
+ /* Initialize the merge heap */
+ state->merge_heap = pairingheap_allocate(bt_merge_heap_cmp, state);
+
+ MemoryContextSwitchTo(old_context);
+
+ return state;
+}
+
+
+/*
+ * bt_merge_getnext
+ * Get the next tuple from the merge scan.
+ *
+ * Returns true if a tuple was found, false if scan is exhausted.
+ * The tuple's TID is stored in scan->xs_heaptid.
+ */
+bool
+bt_merge_getnext(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTMergeScanState *state = so->mergeState;
+ BTMergeCursor *cursor;
+ pairingheap_node *node;
+ int i;
+
+ if (state == NULL)
+ return false;
+
+ /* Initialize cursors on first call */
+ if (!state->initialized)
+ {
+ state->initialized = true;
+ state->direction = dir;
+
+ for (i = 0; i < state->num_cursors; i++)
+ {
+ BTMergeCursor *c = &state->cursors[i];
+
+ if (!c->exhausted &&
+ bt_merge_cursor_init(state, scan, c,
+ c->prefix_value, c->prefix_isnull))
+ {
+ /* Cursor has at least one tuple, add to heap */
+ pairingheap_add(state->merge_heap, &c->ph_node);
+ state->active_cursors++;
+ }
+ }
+ }
+
+ /* Get the cursor with the smallest suffix value */
+ if (pairingheap_is_empty(state->merge_heap))
+ return false;
+
+ node = pairingheap_remove_first(state->merge_heap);
+ cursor = pairingheap_container(BTMergeCursor, ph_node, node);
+
+ /* Set up the heap TID from the current cursor position */
+ Assert(BTScanPosIsValid(cursor->pos));
+ scan->xs_heaptid = cursor->pos.items[cursor->pos.itemIndex].heapTid;
+
+ /* Advance cursor to next tuple */
+ if (bt_merge_cursor_advance(state, scan, cursor))
+ {
+ /* Cursor still has tuples, re-add to heap */
+ pairingheap_add(state->merge_heap, &cursor->ph_node);
+ }
+ else
+ {
+ /* Cursor exhausted */
+ state->active_cursors--;
+ }
+
+ return true;
+}
+
+
+/*
+ * bt_merge_end
+ * Clean up merge scan state.
+ */
+void
+bt_merge_end(BTMergeScanState *state)
+{
+ if (state == NULL)
+ return;
+
+ /* Free the memory context, which frees all allocations */
+ MemoryContextDelete(state->merge_context);
+}
+
+
+/*
+ * bt_merge_cursor_init
+ * Initialize a cursor and position it at the first matching tuple.
+ *
+ * Returns true if the cursor found at least one matching tuple.
+ */
+static bool
+bt_merge_cursor_init(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ Datum prefix_value,
+ bool prefix_isnull)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found;
+
+ if (prefix_isnull)
+ {
+ cursor->exhausted = true;
+ return false;
+ }
+
+ /*
+ * Modify the scan key to use this cursor's prefix value.
+ * We reuse the scan's existing key infrastructure.
+ */
+ for (int i = 0; i < so->numberOfKeys; i++)
+ {
+ if (so->keyData[i].sk_attno == state->prefix_attno)
+ {
+ so->keyData[i].sk_argument = prefix_value;
+ so->keyData[i].sk_flags &= ~(SK_SEARCHARRAY);
+ break;
+ }
+ }
+
+ /* Invalidate current position to force _bt_first */
+ BTScanPosInvalidate(so->currPos);
+
+ /* Disable array key handling for this cursor's scan */
+ so->numArrayKeys = 0;
+
+ /* Position at first matching tuple */
+ found = _bt_first(scan, state->direction);
+
+ if (found)
+ {
+ /* Copy position to cursor */
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ /* Extract the sort key for heap ordering */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ cursor->exhausted = false;
+
+ /* Count this as a tuple access */
+ state->tuples_accessed++;
+
+ /* Invalidate main scan position */
+ BTScanPosInvalidate(so->currPos);
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_cursor_advance
+ * Advance a cursor to its next tuple.
+ *
+ * Returns true if the cursor now points to a valid tuple, false if exhausted.
+ */
+static bool
+bt_merge_cursor_advance(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+
+ if (cursor->exhausted)
+ return false;
+
+ /* Try to move to next tuple within current page's items array */
+ if (state->direction == ForwardScanDirection)
+ {
+ if (cursor->pos.itemIndex < cursor->pos.lastItem)
+ {
+ cursor->pos.itemIndex++;
+ found = true;
+ }
+ }
+ else
+ {
+ if (cursor->pos.itemIndex > cursor->pos.firstItem)
+ {
+ cursor->pos.itemIndex--;
+ found = true;
+ }
+ }
+
+ if (!found)
+ {
+ /*
+ * Current page exhausted. Use _bt_next to get the next page.
+ * We swap our cursor's position into the scan's currPos,
+ * call _bt_next, then swap back.
+ */
+ BTScanPosData save_pos;
+
+ memcpy(&save_pos, &so->currPos, sizeof(BTScanPosData));
+ memcpy(&so->currPos, &cursor->pos, sizeof(BTScanPosData));
+
+ found = _bt_next(scan, state->direction);
+
+ if (found)
+ memcpy(&cursor->pos, &so->currPos, sizeof(BTScanPosData));
+
+ memcpy(&so->currPos, &save_pos, sizeof(BTScanPosData));
+ }
+
+ if (found)
+ {
+ /* Extract new sort key */
+ cursor->sort_key = bt_merge_extract_sortkey(state, scan, cursor,
+ &cursor->sort_key_isnull);
+ state->tuples_accessed++;
+ }
+ else
+ {
+ cursor->exhausted = true;
+ }
+
+ return found;
+}
+
+
+/*
+ * bt_merge_extract_sortkey
+ * Extract the sort key (suffix column value) from the current tuple.
+ */
+static Datum
+bt_merge_extract_sortkey(BTMergeScanState *state,
+ IndexScanDesc scan,
+ BTMergeCursor *cursor,
+ bool *isnull)
+{
+ Relation rel = scan->indexRelation;
+ Buffer buf;
+ Page page;
+ OffsetNumber offnum;
+ ItemId itemid;
+ IndexTuple itup;
+ TupleDesc tupdesc;
+ Datum result;
+
+ if (cursor->pos.currPage == InvalidBlockNumber)
+ {
+ *isnull = true;
+ return (Datum) 0;
+ }
+
+ /* Read the page */
+ buf = ReadBuffer(rel, cursor->pos.currPage);
+ LockBuffer(buf, BT_READ);
+ page = BufferGetPage(buf);
+
+ offnum = cursor->pos.items[cursor->pos.itemIndex].indexOffset;
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ tupdesc = RelationGetDescr(rel);
+
+ /* Extract the suffix column value */
+ result = index_getattr(itup, state->suffix_attno, tupdesc, isnull);
+
+ /* Copy pass-by-reference values before releasing buffer */
+ if (!*isnull)
+ {
+ Form_pg_attribute attr = TupleDescAttr(tupdesc, state->suffix_attno - 1);
+
+ if (!attr->attbyval)
+ result = datumCopy(result, attr->attbyval, attr->attlen);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ return result;
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 77224859685..0d4e7440760 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -20,6 +20,7 @@
#include "catalog/pg_am_d.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
+#include "lib/pairingheap.h"
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
@@ -1050,6 +1051,49 @@ typedef struct BTArrayKeyInfo
ScanKey high_compare; /* array's < or <= upper bound */
} BTArrayKeyInfo;
+/*
+ * BTMergeCursor - tracks scan state for one prefix value in merge scan
+ *
+ * Each cursor maintains its own position within the index for a specific
+ * prefix value. Cursors are organized in a min-heap ordered by their
+ * current suffix key value for efficient K-way merge.
+ */
+typedef struct BTMergeCursor
+{
+ pairingheap_node ph_node; /* pairing heap node for merge */
+ int cursor_id; /* index in merge state's cursors array */
+ Datum prefix_value; /* the prefix value for this sub-scan */
+ bool prefix_isnull; /* is prefix value NULL? */
+ Datum sort_key; /* current tuple's sort key (suffix) */
+ bool sort_key_isnull;/* is sort key NULL? */
+ bool exhausted; /* no more tuples for this prefix */
+ BTScanPosData pos; /* current position in index */
+ char *tuples; /* tuple storage workspace (BLCKSZ) */
+} BTMergeCursor;
+
+/*
+ * BTMergeScanState - state for K-way merge scan
+ *
+ * This structure manages multiple cursors for a merge scan, allowing
+ * lazy evaluation of queries like:
+ * WHERE prefix IN (v1, v2, ..., vK) AND suffix >= b ORDER BY suffix LIMIT N
+ */
+typedef struct BTMergeScanState
+{
+ int num_cursors; /* number of prefix values (K) */
+ int active_cursors; /* cursors not yet exhausted */
+ BTMergeCursor *cursors; /* array of cursors */
+ pairingheap *merge_heap; /* min-heap ordered by sort_key */
+ int prefix_attno; /* attribute number of prefix column (1-based) */
+ int suffix_attno; /* attribute number of suffix column (1-based) */
+ FmgrInfo suffix_cmp; /* comparison function for suffix */
+ Oid suffix_collation; /* collation for suffix comparison */
+ ScanDirection direction; /* scan direction */
+ bool initialized; /* have cursors been initialized? */
+ MemoryContext merge_context;/* memory context for allocations */
+ int64 tuples_accessed;/* count of index tuples accessed */
+} BTMergeScanState;
+
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
@@ -1089,6 +1133,12 @@ typedef struct BTScanOpaqueData
*/
int markItemIndex; /* itemIndex, or -1 if not valid */
+ /*
+ * Merge scan state, if using merge scan optimization.
+ * NULL if not using merge scan.
+ */
+ BTMergeScanState *mergeState;
+
/* keep these last in struct for efficiency */
BTScanPosData currPos; /* current position data */
BTScanPosData markPos; /* marked position, if any */
@@ -1334,4 +1384,18 @@ extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+/*
+ * prototypes for functions in nbtmergescan.c
+ */
+extern BTMergeScanState *bt_merge_init(IndexScanDesc scan,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ int prefix_attno,
+ int suffix_attno,
+ Oid suffix_cmp_oid,
+ Oid suffix_collation);
+extern bool bt_merge_getnext(IndexScanDesc scan, ScanDirection dir);
+extern void bt_merge_end(BTMergeScanState *state);
+
#endif /* NBTREE_H */
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2634a519935..b7b802bfdde 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -18,6 +18,7 @@ subdir('ssl_passphrase_callback')
subdir('test_aio')
subdir('test_binaryheap')
subdir('test_bitmapset')
+subdir('test_btree_merge')
subdir('test_bloomfilter')
subdir('test_cloexec')
subdir('test_copy_callbacks')
diff --git a/src/test/modules/test_btree_merge/Makefile b/src/test/modules/test_btree_merge/Makefile
new file mode 100644
index 00000000000..540416a2c91
--- /dev/null
+++ b/src/test/modules/test_btree_merge/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_btree_merge/Makefile
+
+MODULE_big = test_btree_merge
+OBJS = \
+ $(WIN32RES) \
+ test_btree_merge.o
+
+PGFILEDESC = "test_btree_merge - test code for btree merge scan"
+
+EXTENSION = test_btree_merge
+DATA = test_btree_merge--1.0.sql
+
+REGRESS = test_btree_merge
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_btree_merge
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_btree_merge/expected/test_btree_merge.out b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
new file mode 100644
index 00000000000..baf4d7937e0
--- /dev/null
+++ b/src/test/modules/test_btree_merge/expected/test_btree_merge.out
@@ -0,0 +1,243 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+CREATE EXTENSION test_btree_merge;
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+ test_name
+----------------------------------
+ Test 1: Basic integer merge scan
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+SELECT 'Test 2: More prefix values' AS test_name;
+ test_name
+----------------------------
+ Test 2: More prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 3 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+SELECT 'Test 3: Single prefix value' AS test_name;
+ test_name
+-----------------------------
+ Test 3: Single prefix value
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 6 | 5
+(1 row)
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+SELECT 'Test 4: Large LIMIT' AS test_name;
+ test_name
+---------------------
+ Test 4: Large LIMIT
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 153 | 153 | 153
+(1 row)
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+ test_name
+--------------------------------------
+ Test 5: Non-contiguous prefix values
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+ test_name
+--------------------------
+ Test 6: Timestamp suffix
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 5 | 8 | 7
+(1 row)
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+SELECT 'Test 7: All users timestamp' AS test_name;
+ test_name
+-----------------------------
+ Test 7: All users timestamp
+(1 row)
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+ tuples_returned | tuples_accessed | maximum_required_fetches
+-----------------+-----------------+--------------------------
+ 10 | 15 | 14
+(1 row)
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+SELECT 'Test 8: Correctness verification' AS test_name;
+ test_name
+----------------------------------
+ Test 8: Correctness verification
+(1 row)
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+ status | rn | prefix_col | suffix_col | expected_prefix | expected_suffix
+--------+----+------------+------------+-----------------+-----------------
+(0 rows)
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/meson.build b/src/test/modules/test_btree_merge/meson.build
new file mode 100644
index 00000000000..665d6cf443e
--- /dev/null
+++ b/src/test/modules/test_btree_merge/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_btree_merge_sources = files(
+ 'test_btree_merge.c',
+)
+
+if host_system == 'windows'
+ test_btree_merge_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_btree_merge',
+ '--FILEDESC', 'test_btree_merge - test code for btree merge scan',])
+endif
+
+test_btree_merge = shared_module('test_btree_merge',
+ test_btree_merge_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_btree_merge
+
+test_install_data += files(
+ 'test_btree_merge.control',
+ 'test_btree_merge--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_btree_merge',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_btree_merge',
+ ],
+ },
+}
diff --git a/src/test/modules/test_btree_merge/sql/test_btree_merge.sql b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
new file mode 100644
index 00000000000..5828b343b34
--- /dev/null
+++ b/src/test/modules/test_btree_merge/sql/test_btree_merge.sql
@@ -0,0 +1,207 @@
+-- Unit tests for B-tree merge scan implementation
+-- Tests the core merge scan algorithm directly, bypassing the planner
+
+CREATE EXTENSION test_btree_merge;
+
+-- ============================================================================
+-- Setup: Create test tables with known data distributions
+-- ============================================================================
+
+-- Test table with integer prefix and suffix
+CREATE TABLE merge_test_int (
+ prefix_col int4,
+ suffix_col int4
+);
+
+-- Insert data: 10 prefix values, 100 suffix values each = 1000 rows
+INSERT INTO merge_test_int
+SELECT p, s
+FROM generate_series(1, 10) AS p,
+ generate_series(1, 100) AS s;
+
+CREATE INDEX merge_test_int_idx ON merge_test_int (prefix_col, suffix_col);
+ANALYZE merge_test_int;
+
+-- Test table with integer prefix and timestamp suffix
+CREATE TABLE merge_test_ts (
+ user_id int4,
+ event_time timestamp
+);
+
+-- Insert data: 5 users, 100 events each
+INSERT INTO merge_test_ts
+SELECT u, '2026-01-01 00:00:00'::timestamp + (e || ' minutes')::interval
+FROM generate_series(1, 5) AS u,
+ generate_series(1, 100) AS e;
+
+CREATE INDEX merge_test_ts_idx ON merge_test_ts (user_id, event_time);
+ANALYZE merge_test_ts;
+
+
+-- ============================================================================
+-- Test 1: Basic integer merge scan
+-- Query: WHERE prefix IN (1,2,3) AND suffix >= 50 LIMIT 5
+-- K = 3 prefix values, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 1: Basic integer merge scan' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 2: More prefix values
+-- Query: WHERE prefix IN (1,2,3,4,5) AND suffix >= 80 LIMIT 3
+-- K = 5 prefix values, LIMIT = 3
+-- Expected tuples accessed: 3 + 5 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 2: More prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ 80,
+ 3
+);
+
+
+-- ============================================================================
+-- Test 3: Single prefix value (degenerates to regular scan)
+-- K = 1, LIMIT = 5
+-- Expected tuples accessed: 5 + 1 - 1 = 5
+-- ============================================================================
+
+SELECT 'Test 3: Single prefix value' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 4: Large LIMIT (more than matching rows)
+-- K = 3, prefix values that have 51 rows each (suffix >= 50)
+-- LIMIT = 200 but only 153 rows exist
+-- ============================================================================
+
+SELECT 'Test 4: Large LIMIT' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 50,
+ 200
+);
+
+
+-- ============================================================================
+-- Test 5: Non-contiguous prefix values
+-- Query: WHERE prefix IN (2,5,8) AND suffix >= 50 LIMIT 5
+-- Tests that merge scan works with gaps in prefix values
+-- K = 3 prefix values (non-adjacent), LIMIT = 5
+-- ============================================================================
+
+SELECT 'Test 5: Non-contiguous prefix values' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[2, 5, 8],
+ 50,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 6: Timestamp suffix column
+-- Query: WHERE user_id IN (1,2,3) AND event_time >= '2026-01-01 01:00:00' LIMIT 5
+-- K = 3, LIMIT = 5
+-- Expected tuples accessed: 5 + 3 - 1 = 7
+-- ============================================================================
+
+SELECT 'Test 6: Timestamp suffix' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3],
+ '2026-01-01 01:00:00'::timestamp,
+ 5
+);
+
+
+-- ============================================================================
+-- Test 7: All users with timestamp
+-- K = 5, LIMIT = 10
+-- Expected tuples accessed: 10 + 5 - 1 = 14
+-- ============================================================================
+
+SELECT 'Test 7: All users timestamp' AS test_name;
+
+SELECT * FROM test_btree_merge_scan_ts(
+ 'merge_test_ts',
+ 'merge_test_ts_idx',
+ ARRAY[1, 2, 3, 4, 5],
+ '2026-01-01 00:30:00'::timestamp,
+ 10
+);
+
+
+-- ============================================================================
+-- Test 8: Correctness verification
+-- Verify merge scan returns rows in exact ORDER BY suffix_col, prefix_col order
+-- Using WITH ORDINALITY to compare row positions
+-- ============================================================================
+
+SELECT 'Test 8: Correctness verification' AS test_name;
+
+-- Compare merge scan vs regular query with row positions (should be empty)
+WITH merge_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM test_btree_merge_fetch_int(
+ 'merge_test_int',
+ 'merge_test_int_idx',
+ ARRAY[1, 2, 3],
+ 90,
+ 10
+ )
+),
+regular_result AS (
+ SELECT row_number() OVER () AS rn, prefix_col, suffix_col
+ FROM (
+ SELECT prefix_col, suffix_col
+ FROM merge_test_int
+ WHERE prefix_col IN (1, 2, 3) AND suffix_col >= 90
+ ORDER BY suffix_col, prefix_col
+ LIMIT 10
+ ) t
+)
+SELECT 'MISMATCH' AS status, m.rn, m.prefix_col, m.suffix_col,
+ r.prefix_col AS expected_prefix, r.suffix_col AS expected_suffix
+FROM merge_result m
+FULL OUTER JOIN regular_result r ON m.rn = r.rn
+WHERE m.prefix_col IS DISTINCT FROM r.prefix_col
+ OR m.suffix_col IS DISTINCT FROM r.suffix_col;
+
+
+-- ============================================================================
+-- Cleanup
+-- ============================================================================
+
+DROP TABLE merge_test_int;
+DROP TABLE merge_test_ts;
+DROP EXTENSION test_btree_merge;
diff --git a/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
new file mode 100644
index 00000000000..9872947d7d7
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge--1.0.sql
@@ -0,0 +1,43 @@
+/* src/test/modules/test_btree_merge/test_btree_merge--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_btree_merge" to load this file. \quit
+
+-- Test merge scan with integer columns
+CREATE FUNCTION test_btree_merge_scan_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Fetch actual rows from merge scan (for correctness verification)
+CREATE FUNCTION test_btree_merge_fetch_int(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start int4,
+ limit_count int4
+) RETURNS TABLE (
+ prefix_col int4,
+ suffix_col int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
+-- Test merge scan with timestamp suffix
+CREATE FUNCTION test_btree_merge_scan_ts(
+ table_name text,
+ index_name text,
+ prefix_values int4[],
+ suffix_start timestamp,
+ limit_count int4
+) RETURNS TABLE (
+ tuples_returned int4,
+ tuples_accessed int4,
+ maximum_required_fetches int4
+) AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.c b/src/test/modules/test_btree_merge/test_btree_merge.c
new file mode 100644
index 00000000000..78b22130ecf
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.c
@@ -0,0 +1,389 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_btree_merge.c
+ * Unit tests for B-tree Merge Scan implementation
+ *
+ * This module provides SQL-callable functions to directly test the
+ * merge scan algorithm without going through the planner.
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
+#include "commands/defrem.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define MAX_RESULTS 10000
+
+/*
+ * MergeScanResult - holds results from a merge scan execution
+ */
+typedef struct MergeScanResult
+{
+ int tuples_returned;
+ int64 tuples_accessed;
+ int num_prefixes;
+ int limit_count;
+ /* For fetch function: collected row data */
+ int32 *prefixes;
+ int32 *suffixes;
+} MergeScanResult;
+
+/*
+ * do_merge_scan - common merge scan execution
+ *
+ * Performs a merge scan with the given parameters and collects results.
+ * If collect_rows is true, fetches and stores actual row data.
+ */
+static void
+do_merge_scan(const char *table_name,
+ const char *index_name,
+ Datum *prefix_values,
+ bool *prefix_nulls,
+ int num_prefixes,
+ Datum suffix_start,
+ Oid suffix_type,
+ RegProcedure suffix_eq_proc,
+ RegProcedure suffix_ge_proc,
+ int limit_count,
+ bool collect_rows,
+ MergeScanResult *result)
+{
+ Oid table_oid;
+ Oid index_oid;
+ Relation heap_rel;
+ Relation index_rel;
+ IndexScanDesc scan;
+ BTScanOpaque so;
+ BTMergeScanState *merge_state;
+ Snapshot snapshot;
+ Oid suffix_cmp_oid;
+ Oid opfamily;
+ const char *opfamily_name;
+ int tuples_returned = 0;
+ int max_results;
+
+ /* Determine operator family based on suffix type */
+ if (suffix_type == INT4OID)
+ opfamily_name = "integer_ops";
+ else if (suffix_type == TIMESTAMPOID)
+ opfamily_name = "datetime_ops";
+ else
+ elog(ERROR, "unsupported suffix type: %u", suffix_type);
+
+ /* Look up table and index */
+ table_oid = RelnameGetRelid(table_name);
+ if (!OidIsValid(table_oid))
+ elog(ERROR, "table \"%s\" does not exist", table_name);
+
+ index_oid = RelnameGetRelid(index_name);
+ if (!OidIsValid(index_oid))
+ elog(ERROR, "index \"%s\" does not exist", index_name);
+
+ /* Open relations */
+ heap_rel = table_open(table_oid, AccessShareLock);
+ index_rel = index_open(index_oid, AccessShareLock);
+
+ /* Get comparison function for suffix type */
+ opfamily = get_opfamily_oid(BTREE_AM_OID,
+ list_make1(makeString(pstrdup(opfamily_name))),
+ false);
+ suffix_cmp_oid = get_opfamily_proc(opfamily, suffix_type, suffix_type,
+ BTORDER_PROC);
+ if (!OidIsValid(suffix_cmp_oid))
+ elog(ERROR, "could not find comparison function for type %u", suffix_type);
+
+ /* Begin index scan */
+ snapshot = GetActiveSnapshot();
+ scan = index_beginscan(heap_rel, index_rel, snapshot, NULL, 2, 0);
+
+ /* Set up scan keys */
+ {
+ ScanKeyData keys[2];
+
+ ScanKeyInit(&keys[0], 1, BTEqualStrategyNumber, suffix_eq_proc,
+ prefix_values[0]);
+ ScanKeyInit(&keys[1], 2, BTGreaterEqualStrategyNumber, suffix_ge_proc,
+ suffix_start);
+ index_rescan(scan, keys, 2, NULL, 0);
+ }
+
+ so = (BTScanOpaque) scan->opaque;
+
+ /* Initialize merge scan */
+ merge_state = bt_merge_init(scan, prefix_values, prefix_nulls,
+ num_prefixes, 1, 2, suffix_cmp_oid, InvalidOid);
+ so->mergeState = merge_state;
+
+ /* Execute scan */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+
+ while (tuples_returned < max_results)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (!bt_merge_getnext(scan, ForwardScanDirection))
+ break;
+
+ if (collect_rows && result->prefixes != NULL)
+ {
+ /* Fetch heap tuple to get actual values */
+ HeapTupleData heapTuple;
+ Buffer heapBuffer;
+ bool isnull;
+
+ heapTuple.t_self = scan->xs_heaptid;
+ if (heap_fetch(heap_rel, snapshot, &heapTuple, &heapBuffer, false))
+ {
+ result->prefixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 1,
+ RelationGetDescr(heap_rel), &isnull));
+ result->suffixes[tuples_returned] =
+ DatumGetInt32(heap_getattr(&heapTuple, 2,
+ RelationGetDescr(heap_rel), &isnull));
+ ReleaseBuffer(heapBuffer);
+ }
+ }
+
+ tuples_returned++;
+
+ if (tuples_returned >= MAX_RESULTS)
+ {
+ elog(WARNING, "merge scan hit safety limit of %d tuples", MAX_RESULTS);
+ break;
+ }
+ }
+
+ /* Collect results before cleanup */
+ result->tuples_returned = tuples_returned;
+ result->tuples_accessed = merge_state->tuples_accessed;
+ result->num_prefixes = num_prefixes;
+ result->limit_count = limit_count;
+
+ /* Clean up */
+ bt_merge_end(merge_state);
+ so->mergeState = NULL;
+ index_endscan(scan);
+ index_close(index_rel, AccessShareLock);
+ table_close(heap_rel, AccessShareLock);
+}
+
+/*
+ * build_stats_result - build the stats result tuple
+ */
+static Datum
+build_stats_result(FunctionCallInfo fcinfo, MergeScanResult *result)
+{
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+ HeapTuple tuple;
+ int max_required_fetches;
+
+ /* Calculate expected max fetches */
+ if (result->tuples_returned < result->limit_count)
+ max_required_fetches = result->tuples_returned;
+ else
+ max_required_fetches = result->limit_count + result->num_prefixes - 1;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context "
+ "that cannot accept type record")));
+
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ values[0] = Int32GetDatum(result->tuples_returned);
+ values[1] = Int32GetDatum((int32) result->tuples_accessed);
+ values[2] = Int32GetDatum(max_required_fetches);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ return HeapTupleGetDatum(tuple);
+}
+
+
+/*
+ * test_btree_merge_scan_int - test merge scan with integer columns
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_int);
+
+Datum
+test_btree_merge_scan_int(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_scan_ts - test merge scan with timestamp suffix
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_scan_ts);
+
+Datum
+test_btree_merge_scan_ts(PG_FUNCTION_ARGS)
+{
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ Timestamp suffix_start = PG_GETARG_TIMESTAMP(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MergeScanResult result = {0};
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ TimestampGetDatum(suffix_start), TIMESTAMPOID,
+ F_INT4EQ, F_TIMESTAMP_GE,
+ limit_count, false, &result);
+
+ return build_stats_result(fcinfo, &result);
+}
+
+
+/*
+ * test_btree_merge_fetch_int - fetch actual rows from merge scan
+ */
+PG_FUNCTION_INFO_V1(test_btree_merge_fetch_int);
+
+Datum
+test_btree_merge_fetch_int(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+
+ typedef struct
+ {
+ int32 *prefixes;
+ int32 *suffixes;
+ int num_results;
+ int current_idx;
+ } FetchContext;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ text *table_name = PG_GETARG_TEXT_PP(0);
+ text *index_name = PG_GETARG_TEXT_PP(1);
+ ArrayType *prefix_array = PG_GETARG_ARRAYTYPE_P(2);
+ int32 suffix_start = PG_GETARG_INT32(3);
+ int32 limit_count = PG_GETARG_INT32(4);
+ Datum *prefix_values;
+ bool *prefix_nulls;
+ int num_prefixes;
+ MemoryContext oldcontext;
+ FetchContext *fctx;
+ MergeScanResult result = {0};
+ TupleDesc tupdesc;
+ int max_results;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ deconstruct_array(prefix_array, INT4OID, sizeof(int32), true, TYPALIGN_INT,
+ &prefix_values, &prefix_nulls, &num_prefixes);
+
+ if (num_prefixes == 0)
+ elog(ERROR, "prefix_values array cannot be empty");
+
+ /* Allocate result storage */
+ max_results = (limit_count > 0) ? limit_count : MAX_RESULTS;
+ fctx = palloc(sizeof(FetchContext));
+ fctx->prefixes = palloc(max_results * sizeof(int32));
+ fctx->suffixes = palloc(max_results * sizeof(int32));
+ fctx->current_idx = 0;
+
+ /* Point result to our storage */
+ result.prefixes = fctx->prefixes;
+ result.suffixes = fctx->suffixes;
+
+ do_merge_scan(text_to_cstring(table_name),
+ text_to_cstring(index_name),
+ prefix_values, prefix_nulls, num_prefixes,
+ Int32GetDatum(suffix_start), INT4OID,
+ F_INT4EQ, F_INT4GE,
+ limit_count, true, &result);
+
+ fctx->num_results = result.tuples_returned;
+
+ /* Build result tuple descriptor */
+ tupdesc = CreateTemplateTupleDesc(2);
+ TupleDescInitEntry(tupdesc, 1, "prefix_col", INT4OID, -1, 0);
+ TupleDescInitEntry(tupdesc, 2, "suffix_col", INT4OID, -1, 0);
+ funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+ funcctx->user_fctx = fctx;
+
+ MemoryContextSwitchTo(oldcontext);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+
+ {
+ FetchContext *fctx = funcctx->user_fctx;
+
+ if (fctx->current_idx < fctx->num_results)
+ {
+ Datum values[2];
+ bool nulls[2] = {false, false};
+ HeapTuple tuple;
+
+ values[0] = Int32GetDatum(fctx->prefixes[fctx->current_idx]);
+ values[1] = Int32GetDatum(fctx->suffixes[fctx->current_idx]);
+ fctx->current_idx++;
+
+ tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ }
+ else
+ {
+ SRF_RETURN_DONE(funcctx);
+ }
+ }
+}
diff --git a/src/test/modules/test_btree_merge/test_btree_merge.control b/src/test/modules/test_btree_merge/test_btree_merge.control
new file mode 100644
index 00000000000..f8146bd0f74
--- /dev/null
+++ b/src/test/modules/test_btree_merge/test_btree_merge.control
@@ -0,0 +1,5 @@
+# test_btree_merge extension
+comment = 'Unit tests for B-tree merge scan'
+default_version = '1.0'
+module_pathname = '$libdir/test_btree_merge'
+relocatable = true
--
2.40.0
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
2026-02-04 07:13 ` Re: New access method for b-tree. Michał Kłeczek <[email protected]>
2026-02-05 06:59 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-06 10:52 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
@ 2026-02-23 22:08 ` Alexandre Felipe <[email protected]>
2026-03-17 12:37 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
0 siblings, 1 reply; 12+ messages in thread
From: Alexandre Felipe @ 2026-02-23 22:08 UTC (permalink / raw)
To: pgsql-hackers; [email protected]; [email protected]; [email protected] <[email protected]>; +Cc: Ants Aasma <[email protected]>; Tomas Vondra <[email protected]>; Alexandre Felipe <[email protected]>; Michał Kłeczek <[email protected]>; [email protected]
Hi Hackers,
Do you think that MERGE-SCAN was a terrible name? I wanted a name that
wouldn't
require much explanation. I named it like this because it relies on a k-way
merge to combine several segments of an index in one result. But we already
have a MERGE statement. Even in the example plan above we can see
an external
merge that has nothing to do with the new feature, and now as I am doing
joins,
I started doing it on the NestedLoop trying to follow the same conditions
that
lead to a memoize. But I added so many fields to the NestedLoop state that I
think it is good to have a separate structure, and maybe a separate node,
and
MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
could
use the Ants nickname TimeLineScan, but of course it is not limited to time
lines (even though realistically, that will probably be the most common use
of
this). Another one I considered was TransposedIndexScan, because it orders
output on (suffix, prefix) instead of (prefix, suffix).
On Fri, Feb 6, 2026 at 10:52 AM Alexandre Felipe <
[email protected]> wrote:
> Hello again hackers!
>
> [email protected] <[email protected]>: That seems to be the one that is probably the
> most familiar with the index scan (based on the commits).
> [email protected] <[email protected]> , [email protected]
> <[email protected]> , [email protected] <[email protected]> as the
> top 3 committers to nbtree over the last ~6 months.
>
> I have made substantial progress on adding a few features. I have
> questions, but I will let you go first :)
>
> Motivation:
> *In technical terms:* this proposal is to take advantage of a btree index
> when the query is filtered by a few distinct prefixes and ordered by a
> suffix and has a limit.
> *In non technical:* This could help to efficiently render a social
> network feed, where each user can select a list of users whose posts they
> want to see, and the posts must be ordered from newest to oldest.
>
>
> *Performance Comparison*
> I did a test with a toy table, please find more details below.
>
> With limit 100
>
> | Method | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge | 13 | 119 | 13 ms |
> | IndexScan | 15,308 | 525,310 | 3,409 ms |
>
> With limit 1,000,000
>
> | Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
> |------------|-----------:|---------:|-------:|-------:|----------:|
> | Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
> | Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
> | Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
> | IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |
>
> Sequential scans and bitmap scans in this case reduces significantly the
> number of
> accessed buff because the table has only four integer columns, and these
> methods
> can read all the lines on a given page at a time.
>
> However that comes at the cost of resorting to an in-disk sort method.
> For the query with limit 100 we get no temp files as we are using a
> top-100 sort.
>
> make check passes
>
>
> *Experiment details*
>
> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>
>
> ```sql
> CREATE TABLE grid AS (
> SELECT a, b, c, d, FROM
> generate_series(1, 100) AS a,
> generate_series(1, 100) AS b,
> generate_series(1, 100) AS c,
> generate_series(1, 100) AS d
> );
>
> CREATE INDEX grid_index ON grid (a, b, c);
> ANALYSE grid;
> ```
>
> Now let's say that we need to find certain number of rows filtered by a
> and ordered by b;
> ```sql
> PREPARE grid_query(int) AS
> SELECT sum(d) FROM (
> SELECT * FROM grid
> WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
> ORDER BY b
> LIMIT $1) t;
> ```
>
> ---
>
>
> Now with limit 100, with index merge scan (notice Index Prefixes in the
> plan).
>
> ```sql
> SET enable_indexmergescan = on;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
> Buffers: shared hit=13 read=119
> -> Limit (cost=0.57..87.29 rows=100 width=16) (actual
> time=5.528..12.999 rows=100.00 loops=1)
> Buffers: shared hit=13 read=119
> -> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
> rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
> Index Cond: (b >= 0)
> *Index Prefixes: *(a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
> Index Searches: 8
> Buffers: shared hit=13 read=119
> Planning:
> Buffers: shared hit=59 read=23
> Planning Time: 4.619 ms
> Execution Time: 13.055 ms
> ```
>
>
> ```sql
> SET enable_indexmergescan = off;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
> Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
> time=3406.624..3408.710 rows=1.00 loops=1)
> Buffers: shared hit=15308 read=525310
> -> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
> time=3406.601..3408.702 rows=100.00 loops=1)
> Buffers: shared hit=15308 read=525310
> -> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
> width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=15308 read=525310
> -> Sort (cost=1602575.14..1610720.98 rows=3258333
> width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
> Sort Key: grid.b
> Sort Method: top-N heapsort Memory: 32kB
> Buffers: shared hit=15308 read=525310
> Worker 0: Sort Method: top-N heapsort Memory: 32kB
> Worker 1: Sort Method: top-N heapsort Memory: 32kB
> -> *Parallel Seq Scan* on grid
> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
> rows=2666666.67 loops=3)
> Filter: ((b >= 0) AND (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
> Rows Removed by Filter: 30666667
> Buffers: shared hit=15234 read=525310
> Planning Time: 0.370 ms
> Execution Time: 3409.134 ms
> ```
>
> Now queries with limit 1,000,000
>
> ```sql
> SET enable_indexmergescan = on;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
>
> Query executed with the proposed access method. Notice in the plan Index
> Prefixes and Index Cond.
> ```text
> Buffers: shared hit=980318 read=19721
> -> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
> time=2.854..2103.438 rows=1000000.00 loops=1)
> Buffers: shared hit=980318 read=19721
> -> Index Scan using grid_a_b_c_idx on grid
> (cost=0.57..867265.91 rows=1000007 width=16) (actual time=2.852..2066.205
> rows=1000000.00 loops=1)
> Index Cond: (b >= 0)
> *Index Prefixes:* (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
> Index Searches: 8
> Buffers: shared hit=980318 read=19721
> Planning Time: 0.328 ms
> Execution Time: 2127.811 ms
> ```
>
> If we disable index_mergescan we naturally we fall into a sequential scan.
>
> ```sql
> SET enable_indexmergescan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
> Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
> -> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
> time=3467.012..3712.044 rows=1000000.00 loops=1)
> Buffers: shared hit=15208 read=525410, temp read=20207
> written=35384
> -> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
> width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=15208 read=525410, temp read=20207
> written=35384
> -> Sort (cost=1941895.62..1950041.45 rows=3258333
> width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
> Sort Key: grid.b
> Sort Method: *external merge Disk: 47016kB*
> Buffers: shared hit=15208 read=525410, temp
> read=20207 written=35384
> Worker 0: Sort Method: external merge Disk: 46976kB
> Worker 1: Sort Method: external merge Disk: 47000kB
> -> *Parallel Seq Scan* on grid
> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
> rows=2666666.67 loops=3)
> Filter: ((b >= 0) AND (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
> Rows Removed by Filter: 30666667
> Buffers: shared hit=15134 read=525410
> Planning Time: 0.332 ms
> Execution Time: 3761.866 ms
> ```
>
> If we disable sequential scans, then we get a bitmap scan
>
> ```sql
> SET enable_seqscan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
> Buffers: shared hit=629 read=113759 written=2, temp read=20207
> written=35385
> -> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
> time=5170.456..5453.433 rows=1000000.00 loops=1)
> Buffers: shared hit=629 read=113759 written=2, temp read=20207
> written=35385
> -> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
> width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=629 read=113759 written=2, temp
> read=20207 written=35385
> -> Sort (cost=1997199.75..2005345.59 rows=3258333
> width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
> Sort Key: grid.b
> Sort Method: external merge Disk: 47032kB
> Buffers: shared hit=629 read=113759 written=2, temp
> read=20207 written=35385
> Worker 0: Sort Method: external merge Disk: 47280kB
> Worker 1: Sort Method: external merge Disk: 46680kB
> -> Parallel Bitmap Heap Scan on grid
> (cost=107691.54..1533348.13 rows=3258333 width=16) (actual
> time=299.891..4489.787 rows=2666666.67 loops=3)
> Recheck Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
> Rows *Removed by Index Recheck*: 2410242
> Heap Blocks: exact=13100 lossy=22639
> Buffers: shared hit=615 read=113759 written=2
> Worker 0: Heap Blocks: exact=13077 lossy=22755
> Worker 1: Heap Blocks: exact=13036 lossy=22421
> -> *Bitmap Index Scan* on grid_a_b_c_idx
> (cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
> rows=8000000.00 loops=1)
> Index Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
> Index Searches: 7
> Buffers: shared hit=13 read=7293 written=2
> Planning Time: 0.165 ms
> Execution Time: 5487.213 ms
> ```
>
> If we disable bitmap scans we finally get an index scan
>
> ```sql
> SET enable_bitmapscan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```
> Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
> -> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
> time=4414.478..4674.400 rows=1000000.00 loops=1)
> Buffers: shared hit=7883221 read=124111, temp read=20699
> written=35385
> -> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
> width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> Buffers: shared hit=7883221 read=124111, temp read=20699
> written=35385
> -> Sort (cost=7200203.05..7208348.88 rows=3258333
> width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
> Sort Key: grid.b
> Sort Method: *external merge Disk: 47304kB*
> Buffers: shared hit=7883221 read=124111, temp
> read=20699 written=35385
> Worker 0: Sort Method: external merge Disk: 47304kB
> Worker 1: Sort Method: external merge Disk: 46384kB
> -> *Parallel Index Scan* using grid_a_b_c_idx on
> grid (cost=0.57..6736351.43 rows=3258333 width=16) (actual
> time=46.925..3796.915 rows=2666666.67 loops=3)
> Index Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
> Index Searches: 7
> Buffers: shared hit=7883208 read=124110
> Planning Time: 0.385 ms
> Execution Time: 4713.325 ms
> ```
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
> [email protected]> wrote:
>
>> Thank you for looking into this.
>>
>> Now we can execute a, still narrow, family queries!
>>
>> Maybe it helps to see this as a *social network feeds*. Imagine a social
>> network, you have a few friends, or follow a few people, and you want to
>> see their updates ordered by date. For each user we have a different
>> combination of users that we have to display. But maybe, even having
>> hundreds of users we will only show the first 10.
>>
>> There is a low hanging fruit on the skip scan, if we need N rows, and one
>> group already has M rows we could stop there.
>> If Nx is the number of friends, and M is the number of posts to show.
>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>> instead of (Nx * N) followed by an (Nx * N) sort.
>> Where M = 10 and N is 1000 this is a significant improvement.
>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
>> heap operations.
>> If everything is on the same page the skip scan would win.
>>
>> The cost estimation is probably far off.
>> I am also not considering the filters applied after this operator, and I
>> don't know if the planner infrastructure is able to adjust it by itself.
>> This is where I would like reviewer's feedback. I think that the planner
>> costs are something to be determined experimentally.
>>
>> Next I will make it slightly more general handling
>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>> ORDER BY b LIMIT N (ignoring s...)
>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
>> (a, b, c)
>>
>> ---
>> Kind Regards,
>> Alexandre
>>
>> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]> wrote:
>>
>>>
>>>
>>> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>>>
>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>>>
>>> I'm also wondering how common is the targeted query pattern? How common
>>> it is to have an IN condition on the leading column in an index, and
>>> ORDER BY on the second one?
>>>
>>>
>>> I have seen this pattern multiple times. My nickname for it is the
>>> timeline view. Think of the social media timeline, showing posts from
>>> all followed accounts in timestamp order, returned in reasonably sized
>>> batches. The naive SQL query will have to scan all posts from all
>>> followed accounts and pass them through a top-N sort. When the total
>>> number of posts is much larger than the batch size this is much slower
>>> than what is proposed here (assuming I understand it correctly) -
>>> effectively equivalent to running N index scans through Merge Append.
>>>
>>>
>>> My workarounds I have proposed users have been either to rewrite the
>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>> in an order by limit. This gives the exact needed merge append plan
>>> shape. But repeating the query N times can get unwieldy when the
>>> number of values grows, so the fallback is:
>>>
>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>> SELECT * FROM posts
>>> WHERE user_id = id
>>> ORDER BY tstamp DESC LIMIT 100)
>>> ORDER BY tstamp DESC LIMIT 100;
>>>
>>> The downside of this formulation is that we still have to fetch a
>>> batch worth of items from scans where we otherwise would have only had
>>> to look at one index tuple.
>>>
>>>
>>> GIST can be used to handle this kind of queries as it supports multiple
>>> sort orders.
>>> The only problem is that GIST does not support ORDER BY column.
>>> One possible workaround is [1] but as described there it does not play
>>> well with partitioning.
>>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>>> I think it would be easier to implement and maintain than a new IAM (but
>>> I don’t have enough knowledge and experience to implement it myself)
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>>> [2]
>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>>
>>> —
>>> Kind regards,
>>> Michal
>>>
>>
^ permalink raw reply [nested|flat] 12+ messages in thread
* Re: New access method for b-tree.
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Re: New access method for b-tree. Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Re: New access method for b-tree. Ants Aasma <[email protected]>
2026-02-04 07:13 ` Re: New access method for b-tree. Michał Kłeczek <[email protected]>
2026-02-05 06:59 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-06 10:52 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-23 22:08 ` Re: New access method for b-tree. Alexandre Felipe <[email protected]>
@ 2026-03-17 12:37 ` Alexandre Felipe <[email protected]>
0 siblings, 0 replies; 12+ messages in thread
From: Alexandre Felipe @ 2026-03-17 12:37 UTC (permalink / raw)
To: pgsql-hackers; [email protected]; [email protected]; [email protected] <[email protected]>; +Cc: Ants Aasma <[email protected]>; Tomas Vondra <[email protected]>; Alexandre Felipe <[email protected]>; Michał Kłeczek <[email protected]>; Andres Freund <[email protected]>
Happy St. Patrick's day!
Based on what I said said in previous emails I see alternative
proposals
#1 Make it simpler by not changing the index access methods.
#2 Make it optimal in some sense by not using generic index searches
and not keeping multiple open index scans.
and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.
Ants Aasma & Tomas Vondra
> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> > SELECT * FROM posts
> > WHERE user_id = id
> > ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >
> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>
An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.
> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.
It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?
ALTERNATIVE #1 - More efficient
Or to avoid having N open index scans we could (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
(3.b) when tuples for prefix are exhausted load load
page from (2.b)
Step 2.a. would possibly waste time extracting tuples that are
not needed, and memory by storing them. Not sure how efficient
this can be compared to having an open index scan.
Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.
I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.
I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.
Do you mean by an executor node that performs the query as if it was written
ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
result += (SELECT FROM table
WHERE prefix = _prefix AND qual(*)
ORDER BY suffix
LIMIT N)
return SELECT * FROM result
ORDER BY suffix
LIMIT N
This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.
Comparing to the previous results,
> | Method | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge | 13 | 119 | 13 ms |
> | IndexScan | 15,308 | 525,310 | 3,409 ms |
This Prefix Batch Scan approach
hit=62 read=773, Execution Time: 80.815 ms
With 8 prefixes, the execution time increased to 80.8/13 ~ 6.21
And the number of buffers by 835/132 ~ 6.32
> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>
Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.
> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> > JOIN posts p ON f.followed_id = p.user_id
> > WHERE f.follower_id = :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and = ANY subqueries.
> >
Automating that transformation seems quite non-trivial (to me).
>
Well, not trivial. To give a rough idea.
wc -l *.patch
113 v2-0001-Test-the-baseline.patch
614 v2-0002-Access-method.patch
850 v2-0003-Planner-integration.patch
1958 v2-0004-Multi-column.patch
2439 v2-0005-Joins.patch
it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.
The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)
Regards,
Alexandre
On Mon, Feb 23, 2026 at 10:08 PM Alexandre Felipe <
[email protected]> wrote:
> Hi Hackers,
>
> Do you think that MERGE-SCAN was a terrible name? I wanted a name that
> wouldn't
>
> require much explanation. I named it like this because it relies on a k-way
>
>
> merge to combine several segments of an index in one result. But we already
>
>
> have a MERGE statement. Even in the example plan above we can see
> an external
>
> merge that has nothing to do with the new feature, and now as I am doing
> joins,
>
> I started doing it on the NestedLoop trying to follow the same conditions
> that
>
> lead to a memoize. But I added so many fields to the NestedLoop state that
> I
>
> think it is good to have a separate structure, and maybe a separate node,
> and
>
> MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
> could
>
> use the Ants nickname TimeLineScan, but of course it is not limited to time
>
>
> lines (even though realistically, that will probably be the most common
> use of
>
> this). Another one I considered was TransposedIndexScan, because it orders
>
>
> output on (suffix, prefix) instead of (prefix, suffix).
>
>
>
>
> On Fri, Feb 6, 2026 at 10:52 AM Alexandre Felipe <
> [email protected]> wrote:
>
>> Hello again hackers!
>>
>> [email protected] <[email protected]>: That seems to be the one that is probably the
>> most familiar with the index scan (based on the commits).
>> [email protected] <[email protected]> , [email protected]
>> <[email protected]> , [email protected] <[email protected]> as
>> the top 3 committers to nbtree over the last ~6 months.
>>
>> I have made substantial progress on adding a few features. I have
>> questions, but I will let you go first :)
>>
>> Motivation:
>> *In technical terms:* this proposal is to take advantage of a btree
>> index when the query is filtered by a few distinct prefixes and ordered by
>> a suffix and has a limit.
>> *In non technical:* This could help to efficiently render a social
>> network feed, where each user can select a list of users whose posts they
>> want to see, and the posts must be ordered from newest to oldest.
>>
>>
>> *Performance Comparison*
>> I did a test with a toy table, please find more details below.
>>
>> With limit 100
>>
>> | Method | Shared Hit | Shared Read | Exec Time |
>> |------------|-----------:|------------:|----------:|
>> | Merge | 13 | 119 | 13 ms |
>> | IndexScan | 15,308 | 525,310 | 3,409 ms |
>>
>> With limit 1,000,000
>>
>> | Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
>> |------------|-----------:|---------:|-------:|-------:|----------:|
>> | Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
>> | Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
>> | Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
>> | IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |
>>
>> Sequential scans and bitmap scans in this case reduces significantly the
>> number of
>> accessed buff because the table has only four integer columns, and these
>> methods
>> can read all the lines on a given page at a time.
>>
>> However that comes at the cost of resorting to an in-disk sort method.
>> For the query with limit 100 we get no temp files as we are using a
>> top-100 sort.
>>
>> make check passes
>>
>>
>> *Experiment details*
>>
>> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>>
>>
>> ```sql
>> CREATE TABLE grid AS (
>> SELECT a, b, c, d, FROM
>> generate_series(1, 100) AS a,
>> generate_series(1, 100) AS b,
>> generate_series(1, 100) AS c,
>> generate_series(1, 100) AS d
>> );
>>
>> CREATE INDEX grid_index ON grid (a, b, c);
>> ANALYSE grid;
>> ```
>>
>> Now let's say that we need to find certain number of rows filtered by a
>> and ordered by b;
>> ```sql
>> PREPARE grid_query(int) AS
>> SELECT sum(d) FROM (
>> SELECT * FROM grid
>> WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
>> ORDER BY b
>> LIMIT $1) t;
>> ```
>>
>> ---
>>
>>
>> Now with limit 100, with index merge scan (notice Index Prefixes in the
>> plan).
>>
>> ```sql
>> SET enable_indexmergescan = on;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>> Buffers: shared hit=13 read=119
>> -> Limit (cost=0.57..87.29 rows=100 width=16) (actual
>> time=5.528..12.999 rows=100.00 loops=1)
>> Buffers: shared hit=13 read=119
>> -> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
>> rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
>> Index Cond: (b >= 0)
>> *Index Prefixes: *(a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>> Index Searches: 8
>> Buffers: shared hit=13 read=119
>> Planning:
>> Buffers: shared hit=59 read=23
>> Planning Time: 4.619 ms
>> Execution Time: 13.055 ms
>> ```
>>
>>
>> ```sql
>> SET enable_indexmergescan = off;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>> Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
>> time=3406.624..3408.710 rows=1.00 loops=1)
>> Buffers: shared hit=15308 read=525310
>> -> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
>> time=3406.601..3408.702 rows=100.00 loops=1)
>> Buffers: shared hit=15308 read=525310
>> -> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
>> width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=15308 read=525310
>> -> Sort (cost=1602575.14..1610720.98 rows=3258333
>> width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
>> Sort Key: grid.b
>> Sort Method: top-N heapsort Memory: 32kB
>> Buffers: shared hit=15308 read=525310
>> Worker 0: Sort Method: top-N heapsort Memory: 32kB
>> Worker 1: Sort Method: top-N heapsort Memory: 32kB
>> -> *Parallel Seq Scan* on grid
>> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
>> rows=2666666.67 loops=3)
>> Filter: ((b >= 0) AND (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>> Rows Removed by Filter: 30666667
>> Buffers: shared hit=15234 read=525310
>> Planning Time: 0.370 ms
>> Execution Time: 3409.134 ms
>> ```
>>
>> Now queries with limit 1,000,000
>>
>> ```sql
>> SET enable_indexmergescan = on;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>>
>> Query executed with the proposed access method. Notice in the plan Index
>> Prefixes and Index Cond.
>> ```text
>> Buffers: shared hit=980318 read=19721
>> -> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
>> time=2.854..2103.438 rows=1000000.00 loops=1)
>> Buffers: shared hit=980318 read=19721
>> -> Index Scan using grid_a_b_c_idx on grid
>> (cost=0.57..867265.91 rows=1000007 width=16) (actual time=2.852..2066.205
>> rows=1000000.00 loops=1)
>> Index Cond: (b >= 0)
>> *Index Prefixes:* (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>> Index Searches: 8
>> Buffers: shared hit=980318 read=19721
>> Planning Time: 0.328 ms
>> Execution Time: 2127.811 ms
>> ```
>>
>> If we disable index_mergescan we naturally we fall into a sequential scan.
>>
>> ```sql
>> SET enable_indexmergescan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>> Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
>> -> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
>> time=3467.012..3712.044 rows=1000000.00 loops=1)
>> Buffers: shared hit=15208 read=525410, temp read=20207
>> written=35384
>> -> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
>> width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=15208 read=525410, temp read=20207
>> written=35384
>> -> Sort (cost=1941895.62..1950041.45 rows=3258333
>> width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
>> Sort Key: grid.b
>> Sort Method: *external merge Disk: 47016kB*
>> Buffers: shared hit=15208 read=525410, temp
>> read=20207 written=35384
>> Worker 0: Sort Method: external merge Disk: 46976kB
>> Worker 1: Sort Method: external merge Disk: 47000kB
>> -> *Parallel Seq Scan* on grid
>> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
>> rows=2666666.67 loops=3)
>> Filter: ((b >= 0) AND (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>> Rows Removed by Filter: 30666667
>> Buffers: shared hit=15134 read=525410
>> Planning Time: 0.332 ms
>> Execution Time: 3761.866 ms
>> ```
>>
>> If we disable sequential scans, then we get a bitmap scan
>>
>> ```sql
>> SET enable_seqscan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>> Buffers: shared hit=629 read=113759 written=2, temp read=20207
>> written=35385
>> -> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
>> time=5170.456..5453.433 rows=1000000.00 loops=1)
>> Buffers: shared hit=629 read=113759 written=2, temp read=20207
>> written=35385
>> -> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
>> width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=629 read=113759 written=2, temp
>> read=20207 written=35385
>> -> Sort (cost=1997199.75..2005345.59 rows=3258333
>> width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
>> Sort Key: grid.b
>> Sort Method: external merge Disk: 47032kB
>> Buffers: shared hit=629 read=113759 written=2, temp
>> read=20207 written=35385
>> Worker 0: Sort Method: external merge Disk: 47280kB
>> Worker 1: Sort Method: external merge Disk: 46680kB
>> -> Parallel Bitmap Heap Scan on grid
>> (cost=107691.54..1533348.13 rows=3258333 width=16) (actual
>> time=299.891..4489.787 rows=2666666.67 loops=3)
>> Recheck Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Rows *Removed by Index Recheck*: 2410242
>> Heap Blocks: exact=13100 lossy=22639
>> Buffers: shared hit=615 read=113759 written=2
>> Worker 0: Heap Blocks: exact=13077 lossy=22755
>> Worker 1: Heap Blocks: exact=13036 lossy=22421
>> -> *Bitmap Index Scan* on grid_a_b_c_idx
>> (cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
>> rows=8000000.00 loops=1)
>> Index Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Index Searches: 7
>> Buffers: shared hit=13 read=7293
>> written=2
>> Planning Time: 0.165 ms
>> Execution Time: 5487.213 ms
>> ```
>>
>> If we disable bitmap scans we finally get an index scan
>>
>> ```sql
>> SET enable_bitmapscan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```
>> Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
>> -> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
>> time=4414.478..4674.400 rows=1000000.00 loops=1)
>> Buffers: shared hit=7883221 read=124111, temp read=20699
>> written=35385
>> -> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
>> width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=7883221 read=124111, temp read=20699
>> written=35385
>> -> Sort (cost=7200203.05..7208348.88 rows=3258333
>> width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
>> Sort Key: grid.b
>> Sort Method: *external merge Disk: 47304kB*
>> Buffers: shared hit=7883221 read=124111, temp
>> read=20699 written=35385
>> Worker 0: Sort Method: external merge Disk: 47304kB
>> Worker 1: Sort Method: external merge Disk: 46384kB
>> -> *Parallel Index Scan* using grid_a_b_c_idx on
>> grid (cost=0.57..6736351.43 rows=3258333 width=16) (actual
>> time=46.925..3796.915 rows=2666666.67 loops=3)
>> Index Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Index Searches: 7
>> Buffers: shared hit=7883208 read=124110
>> Planning Time: 0.385 ms
>> Execution Time: 4713.325 ms
>> ```
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
>> [email protected]> wrote:
>>
>>> Thank you for looking into this.
>>>
>>> Now we can execute a, still narrow, family queries!
>>>
>>> Maybe it helps to see this as a *social network feeds*. Imagine a
>>> social network, you have a few friends, or follow a few people, and you
>>> want to see their updates ordered by date. For each user we have a
>>> different combination of users that we have to display. But maybe, even
>>> having hundreds of users we will only show the first 10.
>>>
>>> There is a low hanging fruit on the skip scan, if we need N rows, and
>>> one group already has M rows we could stop there.
>>> If Nx is the number of friends, and M is the number of posts to show.
>>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>>> instead of (Nx * N) followed by an (Nx * N) sort.
>>> Where M = 10 and N is 1000 this is a significant improvement.
>>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M +
>>> Nx) heap operations.
>>> If everything is on the same page the skip scan would win.
>>>
>>> The cost estimation is probably far off.
>>> I am also not considering the filters applied after this operator, and I
>>> don't know if the planner infrastructure is able to adjust it by itself.
>>> This is where I would like reviewer's feedback. I think that the planner
>>> costs are something to be determined experimentally.
>>>
>>> Next I will make it slightly more general handling
>>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>>> ORDER BY b LIMIT N (ignoring s...)
>>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>>> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
>>> (a, b, c)
>>>
>>> ---
>>> Kind Regards,
>>> Alexandre
>>>
>>> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>>>>
>>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>>>>
>>>> I'm also wondering how common is the targeted query pattern? How common
>>>> it is to have an IN condition on the leading column in an index, and
>>>> ORDER BY on the second one?
>>>>
>>>>
>>>> I have seen this pattern multiple times. My nickname for it is the
>>>> timeline view. Think of the social media timeline, showing posts from
>>>> all followed accounts in timestamp order, returned in reasonably sized
>>>> batches. The naive SQL query will have to scan all posts from all
>>>> followed accounts and pass them through a top-N sort. When the total
>>>> number of posts is much larger than the batch size this is much slower
>>>> than what is proposed here (assuming I understand it correctly) -
>>>> effectively equivalent to running N index scans through Merge Append.
>>>>
>>>>
>>>> My workarounds I have proposed users have been either to rewrite the
>>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>>> in an order by limit. This gives the exact needed merge append plan
>>>> shape. But repeating the query N times can get unwieldy when the
>>>> number of values grows, so the fallback is:
>>>>
>>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>>> SELECT * FROM posts
>>>> WHERE user_id = id
>>>> ORDER BY tstamp DESC LIMIT 100)
>>>> ORDER BY tstamp DESC LIMIT 100;
>>>>
>>>> The downside of this formulation is that we still have to fetch a
>>>> batch worth of items from scans where we otherwise would have only had
>>>> to look at one index tuple.
>>>>
>>>>
>>>> GIST can be used to handle this kind of queries as it supports multiple
>>>> sort orders.
>>>> The only problem is that GIST does not support ORDER BY column.
>>>> One possible workaround is [1] but as described there it does not play
>>>> well with partitioning.
>>>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>>>> I think it would be easier to implement and maintain than a new IAM
>>>> (but I don’t have enough knowledge and experience to implement it myself)
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>>>> [2]
>>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>>>
>>>> —
>>>> Kind regards,
>>>> Michal
>>>>
>>>
^ permalink raw reply [nested|flat] 12+ messages in thread
end of thread, other threads:[~2026-03-20 13:44 UTC | newest]
Thread overview: 12+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-01 10:02 New access method for b-tree. Alexandre Felipe <[email protected]>
2026-02-01 23:54 ` Tomas Vondra <[email protected]>
2026-02-03 16:01 ` Matthias van de Meent <[email protected]>
2026-02-03 22:25 ` Tomas Vondra <[email protected]>
2026-02-03 21:42 ` Ants Aasma <[email protected]>
2026-02-03 22:41 ` Tomas Vondra <[email protected]>
2026-03-20 13:44 ` Alexandre Felipe <[email protected]>
2026-02-04 07:13 ` Michał Kłeczek <[email protected]>
2026-02-05 06:59 ` Alexandre Felipe <[email protected]>
2026-02-06 10:52 ` Alexandre Felipe <[email protected]>
2026-02-23 22:08 ` Alexandre Felipe <[email protected]>
2026-03-17 12:37 ` Alexandre Felipe <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox