public inbox for [email protected]
help / color / mirror / Atom feedFrom: Amit Langote <[email protected]>
To: Daniil Davydov <[email protected]>
Cc: cca5507 <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Cc: Tomas Vondra <[email protected]>
Subject: Re: Batching in executor
Date: Thu, 29 Jan 2026 08:35:13 +0100
Message-ID: <CA+HiwqH-2GmTKLW9kHdnqV4KdFiPfuAdVK2TgqOM2JaaeUYXnw@mail.gmail.com> (raw)
In-Reply-To: <CA+HiwqGufj46_4kT6z5nQrCOxp3rcMDR8fn_t33n2pDKokNRQg@mail.gmail.com>
References: <CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com>
<[email protected]>
<CA+HiwqGqeS_94wxiYa8VpymiR_OtFPKDSpX+Me=MYWO45f5yig@mail.gmail.com>
<[email protected]>
<CA+HiwqGM0ZTeVHicSkGnCp-2U-jvU-KBQCkPJ0N7nAj_c2LjZg@mail.gmail.com>
<CA+HiwqHyE7-oOvtZ+OC-4N7DvKSr8Jbu75erMLQ7O4d6gfkBhg@mail.gmail.com>
<CA+HiwqEPMwhg6pUE4XML2rG4fRqKMpeGTtMwRPGes90f9iOqtg@mail.gmail.com>
<CA+HiwqEZja5rJ78p3FBDZNvynWsHwanxyt6h0YaK_r84NemXng@mail.gmail.com>
<[email protected]>
<CAJDiXggP41+-HrRzT+BmtgmS8wkoZM7b4skkQA5NAe+NFEMPSQ@mail.gmail.com>
<CA+HiwqGufj46_4kT6z5nQrCOxp3rcMDR8fn_t33n2pDKokNRQg@mail.gmail.com>
Hi,
Here is v5 of the patch series.
Patches 0001-0003 add the core batching infrastructure. 0001 adds the
batch table AM API with heapam implementation, 0002 wires up SeqScan
to use it (still returning one slot at a time), and 0003 adds EXPLAIN
(BATCHES). I'd love to hear people's thoughts around TupleBatch
structure added in 0002. I thought about making it a separate patch so
that 0002 will still populate the single ScanState.ss_scanTupleSlot,
but that means we'd still have to call the TAM callback to populate
the tuple in the TAM's batch struct into the slot, defeating the whole
point. With TupleBatch, you have executor_batch_rows number of slots
which are filled in one TAM callback (materialize_all) call. So I
decided to keep the TupleBatch and related things in 0002.
For scans without quals, batching shows 20-30% improvement with no
visible regressions when batching is disabled (batch_rows=0):
SELECT * FROM t LIMIT n (no qual)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
(%diff: positive = faster than master, negative = slower)
Patches 0004-0005 add batched qual evaluation and are more
experimental (see below on why 0005 exists). For quals referencing
early columns, the improvement is significant:
SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 30.19 ms 15.55 ms 48.5%
3M 92.47 ms 50.01 ms 45.9%
10M 325.58 ms 211.83 ms 34.9%
However, for quals on later columns (e.g., 15th), batching provides no
benefit - deformation dominates and batching doesn't help:
SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 44.14 ms 44.56 ms -0.9%
3M 133.89 ms 137.77 ms -2.9%
10M 503.33 ms 528.88 ms -5.1%
I don't have a satisfactory explanation for why batching doesn't help
the deform-heavy case at all. One would expect at least some benefit
from reduced per-tuple overhead, but that's not materializing.
I've also been struggling to understand why 0004 affects the per-tuple
path even when batch_rows=0. For quals with 0% selectivity (all rows
fail the qual), perf shows ExecInterpExpr is noticeably hotter with
the patched code compared to master, even though batching is disabled:
SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
Compare that to 100% selectivity (all rows pass), where there's no regression:
SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
I tried moving batch opcodes to a separate interpreter (0005) thinking
it might be register pressure or jump table effects from adding cases
to ExecInterpExpr's switch. With 0005, the generated assembly for
ExecInterpExpr looks identical to master (same stack frame size, same
epilogue), yet the performance still differs. Specifically, the ldp
instruction in the function epilogue shows 53% hotness in patched vs
35% in master. We still need placeholder entries in the dispatch
table, so it's unclear if this fully isolates the per-tuple path. I'll
continue looking at perf, but I feel like at a bit of a loss here and
would appreciate any insights.
Other changes worth noting:
- I removed the BatchVector intermediate representation that copied
Datums into columnar arrays before qual evaluation (it used to be in
the batched qual patch 0004). Now quals access batch slots' tts_values
directly. This simplifies the code and the copy overhead wasn't paying
off. If we pursue serious vectorization later, this may need to be
revisited, but removing it doesn't degrade performance.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.0K, 2-v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From f772043e2104bf67964418dc80c3abb56bdb069d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 00:57:04 +0900
Subject: [PATCH v5 1/5] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 221 +++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 18 ++
src/include/access/tableam.h | 58 ++++++
src/include/pgstat.h | 5 +
5 files changed, 306 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f30a56ecf55..d8d1bdf5191 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1151,6 +1151,134 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+ Relation rel = scan->rs_base.rs_rd;
+ TupleDesc tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Current batching limitations (may be relaxed in future):
+ *
+ * - Forward scans only: backward scan support would require changes to
+ * batch iteration and page advancement logic.
+ *
+ * - Pagemode required: batching relies on the pre-built rs_vistuples[]
+ * array from heap_prepare_pagescan(). This is guaranteed by
+ * ScanCanUseBatching() which only enables batching when SO_ALLOW_PAGEMODE
+ * is set. Unlike heap_getnextslot, we don't support dynamic fallback to
+ * tuple-at-a-time mode since the batch execution path is selected at
+ * ExecInit time.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * Loop until we find tuples that pass the scan key, or reach end of scan.
+ * We never cross page boundaries within a single batch.
+ */
+ for (;;)
+ {
+ /*
+ * Advance to a page with visible tuples if needed.
+ */
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ lineindex = scan->rs_cindex + 1;
+ linesleft = (lineindex <= scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ }
+ else
+ linesleft = 0;
+
+ while (linesleft == 0)
+ {
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ lineindex = 0;
+ linesleft = scan->rs_ntuples;
+ }
+
+ /*
+ * Walk rs_vistuples[] copying headers into tdata[] until the page
+ * is exhausted or batch capacity is reached.
+ */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex++)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex < scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ Assert(dst->t_tableOid == RelationGetRelid(rel));
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL && !HeapKeyTest(dst, tupdesc, nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ /* Return if we found any tuples; otherwise try next page */
+ if (nout > 0)
+ return nout;
+
+ /* Mark page exhausted so we advance on next iteration */
+ scan->rs_cindex = scan->rs_ntuples;
+ }
+
+ pg_unreachable();
+ return 0;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1483,6 +1611,99 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+ Size alloc_size;
+
+ Assert(maxitems > 0);
+
+ /* Single allocation for HeapBatch header + tupdata array */
+ alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
+ hb = palloc(alloc_size);
+ hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d4b..e4cf7fc296b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2637,6 +2637,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab36b..e2417650c5f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -101,6 +101,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -337,6 +350,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d4d..584b580f7a1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..48e4e034a33 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -697,6 +697,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.47.3
[application/octet-stream] v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.6K, 3-v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 94d0f92c807895e6edadf583a06bb39c5dc52a4c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 27 Jan 2026 14:07:55 +0900
Subject: [PATCH v5 2/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Add executor_batch_rows GUC to specify the maximum number of rows
that can be added into a batch.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 28 ++++
src/backend/access/heap/heapam_handler.c | 16 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 112 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 ++
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 99 ++++++++++++
src/include/executor/execScan.h | 69 +++++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 4 +
17 files changed, 592 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d8d1bdf5191..db91085b07c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1644,6 +1644,34 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e4cf7fc296b..0f6bda7b69f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,21 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps =
+{
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2631,6 +2646,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 87491796523..ffb3b738f6a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..1ef4117b87c
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+ Size alloc_size;
+
+ /* Single allocation for TupleBatch + inslots + outslots arrays */
+ alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
+ b = palloc(alloc_size);
+ inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
+ outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
+ sizeof(TupleTableSlot *) * capacity);
+
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 9f68be17b99..5023eb6756a 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..e5af90e3a0f 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index af3c788ce8b..08d93e6f0be 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = executor_batch_rows;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..4c422c854d0 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2417650c5f..d6154d5ab15 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,6 +354,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 584b580f7a1..bdf733c8b22 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2128,5 +2154,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..2d0066103ce
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 028edb8d9fd..d9185331e22 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -251,4 +251,73 @@ ExecScanExtended(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatchSlot
+ * Batch-driven variant of ExecScanExtended.
+ *
+ * Returns one tuple at a time to callers, but internally fetches tuples
+ * in batches from the AM via accessBatchMtd. This reduces per-tuple AM
+ * call overhead while preserving the single-slot interface expected by
+ * parent nodes.
+ *
+ * The batch is refilled when exhausted by calling accessBatchMtd, which
+ * returns false at end-of-scan.
+ *
+ * Note: EPQ is not supported in the batch path; callers must ensure
+ * es_epq_active is NULL before using this function.
+ */
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5929aabc353..e82fd6c0c8a 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -578,12 +578,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..f6bd59f2af1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..6a191202ced 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -31,6 +31,7 @@
#include "access/skey.h"
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "executor/instrument_node.h"
#include "fmgr.h"
@@ -1206,6 +1207,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.47.3
[application/octet-stream] v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (14.0K, 4-v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From f282f5dde3b4bc58b2cd7b66e55803df26e357aa Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v5 3/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 30 ++++++++++++++
src/backend/commands/explain_state.c | 2 +
src/backend/executor/execBatch.c | 31 +++++++++++++-
src/backend/executor/nodeSeqscan.c | 24 ++++++-----
src/include/commands/explain_state.h | 1 +
src/include/executor/execBatch.h | 16 +++++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 58 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 27 +++++++++++++
9 files changed, 177 insertions(+), 13 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..f3d521e1f93 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -517,6 +518,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2294,6 +2297,33 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ if (es->batches && planstate->ps_Batch)
+ {
+ TupleBatch *b = planstate->ps_Batch;
+
+ if (b->stat_batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) b->stat_batches,
+ TupleBatchAvgRows(b),
+ b->stat_max_rows,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, b->stat_batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ TupleBatchAvgRows(b), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, b->stat_max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows, es);
+ }
+ }
+ }
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 803c74dd178..ad5b223ede7 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
index 1ef4117b87c..ed54e3165c8 100644
--- a/src/backend/executor/execBatch.c
+++ b/src/backend/executor/execBatch.c
@@ -19,7 +19,7 @@
* Allocate and initialize a new TupleBatch envelope.
*/
TupleBatch *
-TupleBatchCreate(TupleDesc scandesc, int capacity)
+TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats)
{
TupleBatch *b;
TupleTableSlot **inslots,
@@ -47,6 +47,12 @@ TupleBatchCreate(TupleDesc scandesc, int capacity)
b->nvalid = 0;
b->next = 0;
+ b->track_stats = track_stats;
+ b->stat_batches = 0;
+ b->stat_rows = 0;
+ b->stat_max_rows = 0;
+ b->stat_min_rows = INT_MAX;
+
return b;
}
@@ -110,3 +116,26 @@ TupleBatchGetNumValid(TupleBatch *b)
{
return b->nvalid;
}
+
+void
+TupleBatchRecordStats(TupleBatch *b, int rows)
+{
+ if (!b->track_stats)
+ return;
+
+ b->stat_batches++;
+ b->stat_rows += rows;
+ if (rows > b->stat_max_rows)
+ b->stat_max_rows = rows;
+ if (rows < b->stat_min_rows && rows > 0)
+ b->stat_min_rows = rows;
+}
+
+double
+TupleBatchAvgRows(TupleBatch *b)
+{
+ if (b->stat_batches == 0)
+ return 0.0;
+
+ return (double) b->stat_rows / b->stat_batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 08d93e6f0be..f36b31d4fbb 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -213,8 +213,9 @@ SeqNextBatch(SeqScanState *node)
TableScanDesc scandesc;
EState *estate;
ScanDirection direction;
+ TupleBatch *b = node->ss.ps.ps_Batch;
- Assert(node->ss.ps.ps_Batch != NULL);
+ Assert(b != NULL);
/*
* get information from the estate and scan state
@@ -237,22 +238,21 @@ SeqNextBatch(SeqScanState *node)
}
/* Lazily create the AM batch payload. */
- if (node->ss.ps.ps_Batch->am_payload == NULL)
+ if (b->am_payload == NULL)
{
const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
Assert(tam && tam->scan_begin_batch);
- node->ss.ps.ps_Batch->am_payload =
- table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
- node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ b->am_payload = table_scan_begin_batch(scandesc, b->maxslots);
+ b->ops = table_batch_callbacks(node->ss.ss_currentRelation);
}
- node->ss.ps.ps_Batch->ntuples =
- table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
- node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
- node->ss.ps.ps_Batch->materialized = false;
+ b->ntuples = table_scan_getnextbatch(scandesc, b->am_payload, direction);
+ b->nvalid = b->ntuples;
+ b->materialized = false;
+ TupleBatchRecordStats(b, b->ntuples);
- return node->ss.ps.ps_Batch->ntuples > 0;
+ return b->ntuples > 0;
}
static bool
@@ -340,8 +340,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
const int cap = executor_batch_rows;
TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
- scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 0b695f7d812..0a99f0f2341 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 2d0066103ce..1efc194d8ff 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
/*
@@ -45,11 +47,18 @@ typedef struct TupleBatch
int nvalid; /* number of returnable tuples in outslots */
int next; /* 0-based index of next tuple to be returned */
+
+ /* Statistics (populated when EXPLAIN ANALYZE BATCHES) */
+ bool track_stats; /* whether to collect stats */
+ int64 stat_batches; /* total number of batches fetched */
+ int64 stat_rows; /* total tuples across all batches */
+ int stat_max_rows; /* max rows in any single batch */
+ int stat_min_rows; /* min rows in any single batch (non-zero) */
} TupleBatch;
/* Helpers */
-extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats);
extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
@@ -96,4 +105,9 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+/* === Batching stats. ===*/
+
+extern void TupleBatchRecordStats(TupleBatch *b, int rows);
+extern double TupleBatchAvgRows(TupleBatch *b);
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..1bec59eea9e 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,61 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+reset executor_batch_rows;
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+ ?column?
+----------
+ 0
+(1 row)
+
+drop table batch_test;
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..7881c674495 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,30 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+reset executor_batch_rows;
+
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+
+drop table batch_test;
+reset executor_batch_rows;
--
2.47.3
[application/octet-stream] v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch (32.2K, 5-v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch)
download | inline diff:
From e155dc70e0370435061da70362175255d83a36ea Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 26 Jan 2026 11:01:44 +0900
Subject: [PATCH v5 4/5] WIP: Add ExecQualBatch() for batched qual evaluation
Introduce batched qual evaluation for SeqScan when quals are simple
AND-trees of Var op Const, Var op Var, or NullTest expressions.
The batch is evaluated using a bitmask, avoiding per-tuple ExecQual()
overhead.
Only leakproof operators are eligible for batching, since batching
changes evaluation order which could otherwise leak data through
side channels before security barrier quals filter rows.
Add supporting infrastructure: EEOP_SCAN_FETCHSOME_BATCH to deform
all tuples in a batch and ExprContext.scan_batch field.
The postgres_fdw regression test is updated to disable batching for
a query with LIMIT, since batching processes entire batches before
checking LIMIT, resulting in different "Rows Removed by Filter"
counts in EXPLAIN ANALYZE output.
---
.../postgres_fdw/expected/postgres_fdw.out | 1 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 1 +
src/backend/executor/execExpr.c | 335 ++++++++++++++++++
src/backend/executor/execExprInterp.c | 224 ++++++++++++
src/backend/executor/execTuples.c | 32 ++
src/backend/executor/nodeSeqscan.c | 28 +-
src/backend/jit/llvm/llvmjit_expr.c | 35 ++
src/backend/jit/llvm/llvmjit_types.c | 3 +
src/include/executor/execExpr.h | 84 ++++-
src/include/executor/execScan.h | 46 +++
src/include/executor/executor.h | 3 +
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 11 +-
13 files changed, 802 insertions(+), 3 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6066510c7c0..67df4233235 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -12208,6 +12208,7 @@ SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
Filter: (t1_3.b === 505)
(14 rows)
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
QUERY PLAN
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f7ab2ed0ac..daffc545a5c 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -4126,6 +4126,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 088eca24021..cc76b760ee7 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -104,6 +104,16 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+ List *leaves; /* List<Node*> of accepted leaves */
+ Bitmapset *attnos; /* Vars referenced by accepted leaves */
+ bool ok; /* stays true if batchable */
+ AttrNumber last_scan; /* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -5064,3 +5074,328 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/*
+ * Extract Var attno from expression, unwrapping RelabelType/TargetEntry.
+ * Returns attno > 0 on success, 0 on failure (not a Var, or system column).
+ */
+static AttrNumber
+extract_var_attno(Expr *expr)
+{
+ if (expr == NULL)
+ return 0;
+ if (IsA(expr, TargetEntry))
+ return extract_var_attno(((TargetEntry *) expr)->expr);
+ if (IsA(expr, RelabelType))
+ return extract_var_attno((Expr *) ((RelabelType *) expr)->arg);
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ return ((Var *) expr)->varattno;
+ return 0;
+}
+
+/*
+ * qual_batchable_walker
+ * Check if a qual tree is eligible for batched evaluation.
+ *
+ * Walks the qual tree and validates that it consists only of:
+ * - AND expressions (OR/NOT disqualify)
+ * - NullTest on simple Vars
+ * - Binary OpExpr with Var op Const or Var op Var arguments
+ *
+ * For OpExpr, the operator must be:
+ * - Strict: ensures NULL inputs produce NULL/false, matching WHERE semantics
+ * - Leakproof: required because batching evaluates all rows before filtering,
+ * which could leak data to non-leakproof operators before security barrier
+ * quals have a chance to filter rows
+ *
+ * On success, populates cxt->leaves with the leaf nodes and cxt->attnos with
+ * the referenced attribute numbers. Sets cxt->ok = false if any node fails
+ * validation.
+ */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+ QualBatchContext *cxt = (QualBatchContext *) context;
+
+ if (node == NULL || !cxt->ok)
+ return false;
+
+ switch (nodeTag(node))
+ {
+ case T_List:
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+ case T_BoolExpr:
+ {
+ BoolExpr *b = (BoolExpr *) node;
+
+ /* Only AND trees are allowed */
+ if (b->boolop != AND_EXPR)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Recurse normally over children */
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+ }
+
+ case T_NullTest:
+ {
+ NullTest *nt = (NullTest *) node;
+ AttrNumber attno = extract_var_attno(nt->arg);
+
+ if (attno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ cxt->attnos = bms_add_member(cxt->attnos, attno);
+ if (attno > cxt->last_scan)
+ cxt->last_scan = attno;
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Do NOT recurse into leaf */
+ return false;
+ }
+
+ case T_OpExpr:
+ {
+ OpExpr *op = (OpExpr *) node;
+ List *args = op->args;
+ AttrNumber lattno,
+ rattno;
+
+ /* Only binary operators */
+ if (list_length(args) != 2)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Must be strict (NULL input -> NULL/false result) */
+ if (!func_strict(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /*
+ * Must be leakproof. Batching changes evaluation order, which
+ * could leak data through side channels before security barrier
+ * quals filter rows.
+ */
+ if (!get_func_leakproof(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ /* Left arg must be a Var */
+ lattno = extract_var_attno(linitial(op->args));
+ if (lattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, lattno);
+ if (lattno > cxt->last_scan)
+ cxt->last_scan = lattno;
+
+ /* Right arg must be Const or Var */
+ if (!IsA(lsecond(op->args), Const))
+ {
+ rattno = extract_var_attno(lsecond(op->args));
+ if (rattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, rattno);
+ if (rattno > cxt->last_scan)
+ cxt->last_scan = rattno;
+ }
+
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ return false; /* leaf; don't recurse */
+ }
+
+ /* Unhandled node type; fall back to per-tuple evaluation */
+ default:
+ cxt->ok = false;
+ break;
+ }
+
+ return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n)
+{
+ BatchQualTerm *term;
+ BatchQualTermKind kind;
+ bool strict;
+ AttrNumber l_attno;
+ AttrNumber r_attno;
+ Datum r_const = (Datum) 0;
+ bool r_isnull = false;
+ FmgrInfo *finfo = NULL;
+ Oid collation;
+
+ if (IsA(n, NullTest))
+ {
+ NullTest *nt = (NullTest *) n;
+
+ kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+ l_attno = extract_var_attno(nt->arg);
+ r_attno = 0;
+ strict = false;
+ collation = InvalidOid;
+
+ if (l_attno == 0)
+ return NULL;
+ }
+ else if (IsA(n, OpExpr))
+ {
+ OpExpr *op = (OpExpr *) n;
+ Expr *l = linitial(op->args);
+ Expr *r = lsecond(op->args);
+
+ l_attno = extract_var_attno(l);
+ if (l_attno == 0)
+ return NULL;
+
+ if (IsA(r, Const))
+ {
+ Const *c = (Const *) r;
+
+ kind = BQTK_VAR_CONST;
+ r_const = c->constvalue;
+ r_isnull = c->constisnull;
+ r_attno = 0;
+ }
+ else
+ {
+ r_attno = extract_var_attno(r);
+ if (r_attno == 0)
+ return NULL;
+ kind = BQTK_VAR_VAR;
+ }
+
+ strict = func_strict(op->opfuncid);
+ collation = exprInputCollation((Node *) op);
+ finfo = palloc(sizeof(FmgrInfo));
+ fmgr_info(op->opfuncid, finfo);
+ }
+ else
+ return NULL;
+
+ term = palloc(sizeof(BatchQualTerm));
+ term->kind = kind;
+ term->strict = strict;
+ term->l_attno = l_attno;
+ term->r_attno = r_attno;
+ term->r_const = r_const;
+ term->r_isnull = r_isnull;
+ term->finfo = finfo;
+ term->collation = collation;
+
+ return term;
+}
+
+/*
+ * ExecInitQualBatch
+ * Build a batched-qual ExprState for evaluating scan quals over a TupleBatch.
+ *
+ * Returns a dedicated ExprState that evaluates the plan's quals in batch mode,
+ * or NULL if the quals are not eligible for batching. The caller should retain
+ * the regular ps->qual for fallback when batching is not used.
+ *
+ * Batching is only possible when the qual tree consists of:
+ * - Top-level AND of simple clauses (no OR, NOT)
+ * - NullTest on a simple Var
+ * - Binary OpExpr with (Var op Const) or (Var op Var), where the operator
+ * is both strict (for proper NULL handling) and leakproof (to avoid
+ * leaking data when evaluation order changes vs. security barrier quals)
+ *
+ * The generated EEOP program:
+ * 1. EEOP_SCAN_FETCHSOME_BATCH - deforms all slots in the batch
+ * 2. EEOP_QUAL_BATCH_INITMASK - initializes bitmask to all-pass
+ * 3. EEOP_QUAL_BATCH_TERM (per leaf) - evaluates term, clears failing bits
+ *
+ * The result bitmask is stored in BatchQualRuntime (via ExprState.batch_private)
+ * for the caller to use when populating output slots.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+ Node *qual = (Node *) ps->plan->qual;
+ QualBatchContext cxt = {NIL, NULL, true, 0};
+ BatchQualRuntime *rt;
+ ExprState *state;
+ int maxrows = executor_batch_rows;
+ uint64 *mask;
+ int mask_words;
+ ListCell *lc;
+ ExprEvalStep scratch = {0};
+
+ if (qual == NULL)
+ return NULL;
+
+ /*
+ * Check if qual tree is batchable; collect leaf nodes and referenced
+ * attnos.
+ */
+ (void) qual_batchable_walker(qual, &cxt);
+ if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+ return NULL;
+
+ /* Allocate bitmask: one bit per row, rounded up to 64-bit words */
+ mask_words = (maxrows + 63) >> 6;
+ mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+ /* Bundle runtime state; attached to ExprState for access during execution */
+ rt = palloc0(sizeof(BatchQualRuntime));
+ rt->mask = mask;
+ rt->mask_words = mask_words;
+
+ /* Create ExprState for the batched program */
+ state = makeNode(ExprState);
+ state->expr = (Expr *) qual;
+ state->parent = ps;
+ state->ext_params = NULL;
+ state->flags = EEO_FLAG_IS_QUAL;
+ state->batch_private = (void *) rt;
+
+ /* Step 1: deform all slots in batch up to highest referenced attribute */
+ scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = cxt.last_scan;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 2 initialize mask to all-ones (all rows pass initially) */
+ scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+ scratch.d.qualbatch_init.mask = mask;
+ scratch.d.qualbatch_init.mask_words = mask_words;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 3: one TERM per qual leaf; each clears mask bits for failing rows */
+ foreach(lc, cxt.leaves)
+ {
+ BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc));
+
+ if (term == NULL)
+ return NULL;
+
+ scratch.opcode = EEOP_QUAL_BATCH_TERM;
+ scratch.d.qualbatch_term.term = term; /* by value */
+ ExprEvalPushStep(state, &scratch);
+ }
+
+ /* Done; mask now indicates which rows survived all quals */
+ scratch.opcode = EEOP_DONE_NO_RETURN;
+ ExprEvalPushStep(state, &scratch);
+
+ ExecReadyExpr(state);
+
+ return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a7a5ac1e83b..304c7f4e0fb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -466,6 +467,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -592,6 +594,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
+ &&CASE_EEOP_QUAL_BATCH_INITMASK,
+ &&CASE_EEOP_QUAL_BATCH_TERM,
&&CASE_EEOP_LAST
};
@@ -612,6 +617,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2265,6 +2271,28 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+ {
+ ExecQualBatchInitMask(state, op, econtext);
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5942,199 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+ for (int i = 0; i < nwords; i++)
+ a[i] = ~UINT64CONST(0);
+
+ if ((nvalid_bits & 63) != 0)
+ {
+ int rem = nvalid_bits & 63;
+
+ a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+ }
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+ a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+ for (int i = 0; i < nwords; i++)
+ {
+ if (mask[i] != 0)
+ return false;
+ }
+ return true;
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ TupleBatch *b = econtext->scan_batch;
+ uint64 *mask = op->d.qualbatch_init.mask;
+ int nwords = op->d.qualbatch_init.mask_words;
+ int n = b->ntuples;
+
+ /* initialize to all-pass for current batch size */
+ mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+ TupleBatch *b = econtext->scan_batch;
+ TupleTableSlot **slots = b->activeslots;
+ uint64 *mask = rt->mask;
+ int mask_words = rt->mask_words;
+ BatchQualTerm *t = op->d.qualbatch_term.term;
+ int n = b->ntuples;
+
+ /* Early exit if no rows remain */
+ if (mask_is_empty(mask, mask_words))
+ return;
+
+ switch (t->kind)
+ {
+ case BQTK_IS_NULL:
+ {
+ /* keep bit set only if value IS NULL; clear otherwise */
+ for (int i = 0; i < n; i++)
+ {
+ if (!slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_IS_NOT_NULL:
+ {
+ /* keep bit set only if value IS NOT NULL; clear if NULL */
+ for (int i = 0; i < n; i++)
+ {
+ if (slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_CONST:
+ {
+ const bool r_isnull = t->r_isnull;
+ const Datum r_const = t->r_const;
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool pass;
+
+ /* WHERE treats NULL as false; strict ops short-circuit */
+ if (strict && (ln || r_isnull))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_VAR:
+ {
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool rn = slots[i]->tts_isnull[t->r_attno-1];
+ bool pass;
+
+ if (strict && (ln || rn))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+ Datum rv = slots[i]->tts_values[t->r_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ default:
+ /* should not happen; leave mask unchanged */
+ break;
+ }
+}
+
+/*
+ * ExecQualBatch
+ * Evaluate a batched qual over all rows in a TupleBatch.
+ *
+ * Runs the EEOP program built by ExecInitQualBatch, which produces a bitmask
+ * indicating which rows pass the qual. Rows that pass are copied to the
+ * batch's output slots (b->outslots).
+ *
+ * Returns the number of qualifying rows. The caller should then call
+ * TupleBatchUseOutput(b, qualified) to switch the batch to return from
+ * outslots.
+ *
+ * The batch must be materialized (slots populated) before calling this.
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+ int i;
+ uint64 *mask;
+ int kept = 0;
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+
+ /* verify that expression was compiled using ExecInitQualBatch */
+ Assert(state->flags & EEO_FLAG_IS_QUAL);
+ Assert(rt && rt->mask && rt->mask_words);
+
+ /* run the batched EEOP program once */
+ econtext->scan_batch = b;
+ ExecEvalExprNoReturn(state, econtext);
+
+ mask = rt->mask;
+ if (mask_is_empty(mask, rt->mask_words))
+ return 0;
+
+ /* Add survivors into outslots */
+ TupleBatchRewind(b);
+ i = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* mask bit set => row survives */
+ if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+ TupleBatchStoreInOut(b, kept++, slot);
+ i++;
+ }
+
+ return kept;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index b768eae9e53..5082d8ecd3b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f36b31d4fbb..16f15ed68aa 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -281,6 +281,28 @@ ExecSeqScanBatchSlot(PlanState *pstate)
NULL, NULL);
}
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithBatchQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+ TupleBatch *b = pstate->ps_Batch;
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual_batch != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ if (!TupleBatchHasMore(b))
+ b = ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual_batch, NULL);
+
+ return b ? TupleBatchGetNextSlot(b) : NULL;
+}
+
static TupleTableSlot *
ExecSeqScanBatchSlotWithQual(PlanState *pstate)
{
@@ -344,6 +366,7 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
+ scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
@@ -361,7 +384,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
- scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ if (scanstate->ss.ps.qual_batch == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithBatchQual;
}
else
{
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 650f1d42a93..847f265df3b 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +224,11 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -2940,6 +2948,33 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_QUAL_BATCH_INITMASK:
+ build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ case EEOP_QUAL_BATCH_TERM:
+ build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 4636b90cd0f..5ba9920f3fd 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,10 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecQualBatchInitMask,
+ ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index aa9b361fa31..2672d2674cc 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -292,11 +292,29 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /*
+ * Batched qual evaluation opcodes
+ *
+ * These opcodes implement batch-mode qual evaluation where an entire
+ * TupleBatch is processed at once rather than tuple-by-tuple.
+ *
+ * EEOP_SCAN_FETCHSOME_BATCH: Call slot_getsomeattrs() on all slots in
+ * the batch to ensure needed attributes are deformed.
+ *
+ * EEOP_QUAL_BATCH_INITMASK: Initialize the result bitmask to all-ones
+ * (all rows initially pass).
+ *
+ * EEOP_QUAL_BATCH_TERM: Evaluate one qual leaf (NullTest or OpExpr) over
+ * all rows, clearing mask bits for rows that fail.
+ */
+ EEOP_SCAN_FETCHSOME_BATCH,
+ EEOP_QUAL_BATCH_INITMASK,
+ EEOP_QUAL_BATCH_TERM,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +349,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +793,17 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ struct
+ {
+ uint64 *mask; /* shared mask buffer for this program */
+ int mask_words; /* ceil(es_max_batch/64) */
+ } qualbatch_init; /* EEOP_QUAL_BATCH_INITMASK */
+
+ struct
+ {
+ struct BatchQualTerm *term; /* compiled leaf */
+ } qualbatch_term; /* EEOP_QUAL_BATCH_TERM */
} d;
} ExprEvalStep;
@@ -917,4 +952,51 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+ BQTK_VAR_CONST,
+ BQTK_VAR_VAR,
+ BQTK_IS_NULL,
+ BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+ BatchQualTermKind kind;
+ bool strict; /* follow strict NULL semantics if true */
+ AttrNumber l_attno; /* left VAR column */
+ AttrNumber r_attno; /* right VAR column, or -1 if Const */
+ Datum r_const; /* for VAR_CONST */
+ bool r_isnull; /* for VAR_CONST */
+ FmgrInfo *finfo; /* fmgr for generic binary ops */
+ Oid collation; /* op collation */
+} BatchQualTerm;
+
+/*
+ * BatchQualRuntime - execution state for batched qual evaluation
+ *
+ * Attached to ExprState.batch_private for the batched qual program.
+ * Contains the bitmask that tracks which rows pass the qual (bit set = pass),
+ * and references to the BatchVector for EEOP_QUAL_BATCH_TERM to use.
+ *
+ * The mask uses standard bit operations: word = i/64, bit = i%64.
+ * Initialized to all-ones by EEOP_QUAL_BATCH_INITMASK, then each
+ * EEOP_QUAL_BATCH_TERM clears bits for failing rows.
+ */
+typedef struct BatchQualRuntime
+{
+ uint64 *mask;
+ int mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+ return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index d9185331e22..008780ea230 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -320,4 +320,50 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatch
+ * Batch-driven scan with batched qual evaluation.
+ *
+ * Unlike ExecScanExtendedBatchSlot which evaluates quals tuple-at-a-time,
+ * this function uses ExecQualBatch() to evaluate the entire batch at once
+ * using a bitmask. Qualifying tuples are collected into b->outslots.
+ *
+ * Returns the TupleBatch with nvalid set to the number of qualifying rows,
+ * or NULL at end-of-scan. Caller iterates b->outslots[0..nvalid-1].
+ *
+ * Note: EPQ is not supported; projection is not yet implemented.
+ */
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual_batch, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ ResetExprContext(econtext);
+ qualified = ExecQualBatch(qual_batch, econtext, b);
+ InstrCountFiltered1(node, b->nvalid - qualified);
+ /* Update count and start using b->outslots. */
+ TupleBatchUseOutput(b, qualified);
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e82fd6c0c8a..8cded15dec6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -326,6 +326,7 @@ ExecProcNode(PlanState *node)
extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -553,6 +554,8 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index a2dfd707e78..b06be83b141 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -346,6 +346,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6a191202ced..c79ee965372 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -148,6 +148,9 @@ typedef struct ExprState
* ExecInitExprRec().
*/
ErrorSaveContext *escontext;
+
+ /* batched-program runtime (e.g., BatchQualRuntime) */
+ void *batch_private;
} ExprState;
@@ -314,6 +317,10 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
TupleTableSlot *ecxt_newtuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 16
+ TupleBatch *scan_batch;
+
/* Link to containing EState (NULL if a standalone ExprContext) */
struct EState *ecxt_estate;
@@ -1186,7 +1193,9 @@ typedef struct PlanState
* state trees parallel links in the associated plan tree (except for the
* subPlan list, which does not exist in the plan tree).
*/
- ExprState *qual; /* boolean qual condition */
+ ExprState *qual; /* boolean qual condition (per tuple) */
+ ExprState *qual_batch; /* batched qual program, NULL if qual not
+ * batchable */
PlanState *lefttree; /* input plan tree(s) */
PlanState *righttree;
--
2.47.3
[application/octet-stream] v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch (5.9K, 6-v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch)
download | inline diff:
From 4916a0891b2e7176dee3c2a3a8018a4d174dd373 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 05:03:55 +0900
Subject: [PATCH v5 5/5] WIP: Use dedicated interpreter for batched qual
evaluation
Move batch-related opcodes (EEOP_SCAN_FETCHSOME_BATCH,
EEOP_QUAL_BATCH_INITMASK, EEOP_QUAL_BATCH_TERM) out of the main
ExecInterpExpr switch and into a dedicated ExecInterpQualBatch
function.
Adding opcodes to ExecInterpExpr may affect performance even when
they are not executed, possibly due to changes in register allocation,
jump table layout, or code size. Use a separate interpreter to avoid
any risk of impacting the existing per-tuple evaluation path.
The batched qual program has a simple linear structure (fetch ->
initmask -> term* -> done) that doesn't need computed goto dispatch
anyway.
---
src/backend/executor/execExprInterp.c | 72 +++++++++++++++++----------
src/backend/executor/nodeSeqscan.c | 6 +--
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 304c7f4e0fb..04a40ec932c 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -189,6 +189,8 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static Datum ExecInterpQualBatch(ExprState *state, ExprContext *econtext);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -266,6 +268,12 @@ ExecReadyInterpretedExpr(ExprState *state)
*/
state->evalfunc = ExecInterpExprStillValid;
+ if (state->batch_private)
+ {
+ state->evalfunc_private = (void *) ExecInterpQualBatch;
+ return;
+ }
+
/* DIRECT_THREADED should not already be set */
Assert((state->flags & EEO_FLAG_DIRECT_THREADED) == 0);
@@ -467,7 +475,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
- TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -594,9 +601,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
- &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
- &&CASE_EEOP_QUAL_BATCH_INITMASK,
- &&CASE_EEOP_QUAL_BATCH_TERM,
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_SCAN_FETCHSOME_BATCH */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_INITMASK */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_TERM */
&&CASE_EEOP_LAST
};
@@ -617,7 +624,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
- scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2271,34 +2277,18 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
- EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
- {
- CheckOpSlotCompatibility(op, scanslot);
-
- Assert(scanbatch);
- slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
-
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
- {
- ExecQualBatchInitMask(state, op, econtext);
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_TERM)
- {
- ExecQualBatchTerm(state, op, econtext);
- EEO_NEXT();
- }
-
EEO_CASE(EEOP_LAST)
{
/* unreachable */
Assert(false);
goto out_error;
}
+
+ EEO_CASE(EEOP_BATCH_UNREACHABLE)
+ {
+ Assert(false && "batch opcodes use dedicated interpreter");
+ pg_unreachable();
+ }
}
out_error:
@@ -6089,6 +6079,34 @@ ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
}
}
+static Datum
+ExecInterpQualBatch(ExprState *state, ExprContext *econtext)
+{
+ ExprEvalStep *op = state->steps;
+ TupleBatch *scanbatch = econtext->scan_batch;
+
+ /* Step 1: fetch/deform all slots */
+ Assert(ExecEvalStepOp(state, op) == EEOP_SCAN_FETCHSOME_BATCH);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+ op++;
+
+ /* Step 2: initialize mask */
+ Assert(ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_INITMASK);
+ ExecQualBatchInitMask(state, op, econtext);
+ op++;
+
+ /* Step 3: process all TERM steps */
+ while (ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ op++;
+ }
+
+ Assert(ExecEvalStepOp(state, op) == EEOP_DONE_NO_RETURN);
+
+ return (Datum) 0;
+}
+
/*
* ExecQualBatch
* Evaluate a batched qual over all rows in a TupleBatch.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 16f15ed68aa..4a76108bd2f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -404,7 +404,6 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
- bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -435,12 +434,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
- use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
-
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
- use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -477,7 +473,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
- if (use_batching)
+ if (ScanCanUseBatching(&scanstate->ss, eflags))
SeqScanInitBatching(scanstate, eflags);
return scanstate;
--
2.47.3
view thread (9+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected]
Subject: Re: Batching in executor
In-Reply-To: <CA+HiwqH-2GmTKLW9kHdnqV4KdFiPfuAdVK2TgqOM2JaaeUYXnw@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox